Technological debt is insidious, a kind of socio-infrastructural subprime crisis that's unfolding around us in slow motion. Our digital infrastructure is built atop layers and layers and layers of code that's insecure due to a combination of bad practices and bad frameworks.

1/ An industrial meat-grinder; on its intake belt is a processi
Even people who write secure code import insecure libraries, or plug it into insecure authorization systems or databases. Like asbestos in the walls, this cruft has been fragmenting, drifting into our air a crumb at a time.

2/
We ignored these, treating them as containable, little breaches and now the walls are rupturing and choking clouds of toxic waste are everywhere.

pluralistic.net/2021/07/27/gas…

3/
The infosec apocalypse was decades in the making. The machine learning apocalypse, on the other hand...

ML has serious, institutional problems, the kind of thing you'd expect in a nascent discipline, which you'd hope would be worked out before it went into wide deployment.

4/
ML is rife with all forms of statistical malpractice - AND it's being used for high-speed, high-stakes automated classification and decision-making, as if it was a proven science whose professional ethos had the sober gravitas you'd expect from, say, civil engineering.

5/
Civil engineers spend a lot of time making sure the buildings and bridges they design don't kill the people who use them. Machine learning?

Hundreds of ML teams built models to automate covid detection, and every single one was useless or worse.

pluralistic.net/2021/08/02/aut…

6/
The ML models failed due to failure to observe basic statistical rigor. One common failure mode?

Treating data that was known to be of poor quality as if it was reliable because good data was not available.

7/
Obtaining good data and/or cleaning up bad data is tedious, repetitive grunt-work. It's unglamorous, time-consuming, and low-waged. Cleaning data is the equivalent of sterilizing surgical implements - vital, high-skilled, and invisible unless someone fails to do it.

8/
It's work performed by anonymous, low-waged adjuncts to the surgeon, who is the star of the show and who gets credit for the success of the operation.

9/
The title of a @GoogleAI team (Nithya Sambasivan et al) paper published in @acm_chi beautifully summarizes how this is playing out in ML: "Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI,"

storage.googleapis.com/pub-tools-publ…

10/
The paper analyzes ML failures from a cross-section of high-stakes projects (health diagnostics, anti-poaching, etc) in East Africa, West Africa and India. They trace the failures of these projects to data-quality, and drill into the factors that caused the data problems.

11/
The failures stem from a variety of causes. First, data-gathering and cleaning are low-waged, invisible, and thankless work. Front-line workers who produce the data - like medical professionals who have to do extra data-entry - are not compensated for extra work.

12/
Often, no one even bothers to explain what the work is for. Some of the data-cleaning workers are atomized pieceworkers, such as those who work for Amazon's Mechanical Turk, who lack both the context in which the data was gathered and the context for how it will be used.

13/
This data is passed to model-builders, who lack related domain expertise. The hastily labeled X-ray of a broken bone, annotated by an unregarded and overworked radiologist, is passed onto a data-scientist who knows nothing about broken bones and can't assess the labels.

14/
This is an age-old problem in automation, pre-dating computer science and even computers. The "scientific management" craze that started in the 1880s saw technicians observing skilled workers with stopwatches and clipboards, then restructuring the workers' jobs by fiat.

15/
Rather than engaging in the anthropological work that Clifford Geertz called #ThickDescription, the management "scientists" discarded workers' qualitative experience, then treated their own assessments as quantitative and thus empirical.

hypergeertz.jku.at/GeertzTexts/Th…

16/
How long a task takes is empirical, but what you call a "task" is subjective. Computer scientists take quantitative measurements, but decide what to measure on the basis of subjective judgment. This empiricism-washing sleight of hand is endemic to ML's claims of neutrality.

17/
In the early 2000s, there was a movement to produce tools and training that would let domain experts produce their own tools - rather than delivering "requirements" to a programmer, a bookstore clerk or nurse or librarian could just make their own tools using Visual Basic.

18/
This was the radical humanist version of "learn to code" - a call to seize the means of computation and program, rather than being programmed. Over time, it was watered down, and today it lives on as a weak call for domain experts to be included in production.

19/
The disdain for the qualitative expertise of domain experts who produce data is a well-understood guilty secret within ML circles, embodied in Frederick Jelinek's ironic talk, "Every time I fire a linguist, the performance of the speech recognizer goes up."

20/
But a thick understanding of context is vital to improving data-quality. Take the American "voting wars," where GOP-affiliated vendors are brought in to purge voting rolls of duplicate entries - people who are registered to vote in more than one place.

21/
These tools have a 99% false-positive rate.

Ninety. Nine. Percent.

To understand how they go so terribly wrong, you need a thick understanding of the context in which the data they analyze is produced.

5harad.com/papers/1p1v.pdf

22/
The core assumption of these tools is that two people with the same name and date of birth are probably the same person.

But guess what month people named "June" are likely to be born in? Guess what birthday is shared by many people named "Noel" or "Carol"?

23/
Many states represent unknown birthdays as "January 1," or "January 1, 1901." If you find someone on a voter roll whose birthday is represented as 1/1, you have no idea what their birthday is, and they almost certainly don't share a birthday with other 1/1s.

24/
But false positives aren't evenly distributed. Ethnic groups whose surnames were assigned in recent history for tax-collection purposes (Ashkenazi Jews, Han Chinese, Koreans, etc) have a relatively small pool of surnames and a slightly larger pool of first names.

25/
This is likewise true of the descendants of colonized and enslaved people, whose surnames were assigned to them for administrative purposes and see a high degree of overlap. When you see two voter rolls with a Juan Gomez born on Jan 1, you need to apply thick analysis.

26/
Unless, of course, you don't care about purging the people who are most likely to face structural impediments to voter registration (such as no local DMV office) and who are also likely to be racialized (for example, migrants whose names were changed at Ellis Island).

27/
ML practitioners don't merely use poor quality data when good quality data isn't available - they also use the poor quality data to assess the resulting models. When you train an ML model, you hold back some of the training data for assessment purposes.

28/
So maybe you start with 10,000 eye scans labeled for the presence of eye disease. You train your model with 9,000 scans and then ask the model to assess the remaining 1,000 scans to see whether it can make accurate classifications.

29/
But if the data is no good, the assessment is also no good. As the paper's authors put it, it's important to "catch[] data errors using mechanisms specific to data validation, instead of using model performance as a proxy for data quality."

30/
ML practitioners studied for the paper - practitioners engaged in "high-stakes" model building reported that they had to gather their own data for their models through field partners, "a task which many admitted to being unprepared for."

31/
High-stakes ML work has inherited a host of sloppy practices from ad-tech, where ML saw its first boom. Ad-tech aims for "70-75% accuracy."

32/
That may be fine if you're deciding whether to show someone an ad, but it's a very different matter if you're deciding whether someone needs treatment for an eye-disease that, untreated, will result in irreversible total blindness.

33/
Even when models are useful at classifying input produced under present-day lab conditions, those conditions are subject to several kinds of "drift."

34/
For example, "hardware drift," where models trained on images from pristine new cameras are asked to assess images produced by cameras from field clinics, where lenses are impossible to keep clean (see also "environmental drift" and "human drift").

35/
Bad data makes bad models. Bad models instruct people to make ineffective or harmful interventions. Those bad interventions produce more bad data, which is fed into more bad models - it's a "data-cascade."

36/
#GIGO - Garbage In, Garbage Out - was already a bedrock of statistical practice before the term was coined in 1957. Statistical analysis and inference cannot proceed from bad data.

37/
Producing good data and validating data-sets are the kind of unsexy, undercompensated maintenance work that all infrastructure requires - and, as with other kinds of infrastructure, it is undervalued by journals, academic departments, funders, corporations and governments.

38/
But all technological debts accrue punitive interest. The decision to operate on bad data because good data is in short supply isn't like looking for your car-keys under the lamp-post - it's like driving with untrustworthy brakes and a dirty windscreen.

39/
ETA - If you'd like an unrolled version of this thread to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:

pluralistic.net/2021/08/19/fai…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Cory Doctorow AFK

Cory Doctorow AFK Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @doctorow

5 Sep
Creature from the Haunted Sea (1961) wilwheaton.tumblr.com/post/661452100…
Creature from the Haunted Sea (1961) wilwheaton.tumblr.com/post/661452100…
Creature from the Haunted Sea (1961) wilwheaton.tumblr.com/post/661452100…
Read 5 tweets
5 Sep
Betty Boop’s Hallowe'en Party (1933) starrywisdomsect.tumblr.com/post/661441908…
Betty Boop’s Hallowe'en Party (1933) starrywisdomsect.tumblr.com/post/661441908…
Betty Boop’s Hallowe'en Party (1933) starrywisdomsect.tumblr.com/post/661441908…
Read 6 tweets
5 Sep
Conan the Librarian

UHF (1989)
gameraboy2.tumblr.com/post/661489482…
Conan the Librarian

UHF (1989)
gameraboy2.tumblr.com/post/661489482…
Conan the Librarian

UHF (1989)
gameraboy2.tumblr.com/post/661489482…
Read 4 tweets
4 Sep
Today's Twitter threads (a Twitter thread).

Inside: Proctorio's awful reviews disappear down the memory hole; and more!

Archived at: pluralistic.net/2021/09/04/hyp…

#Pluralistic

1/
This is the last edition of Pluralistic for a little while. I'm having major surgery - a hip replacement - on Tuesday, and the doc tells me the biggest risk when someone as young as I am gets joint replacement is pushing the recovery too hard and injuring yourself.

2/
So I'm taking some time off to recover. I'll be back mid-to-late-September-ish, unless it's later. Take care of yourself and the people around you, and I'll see you then.

3/
Read 17 tweets
4 Sep
Remember @proctorio? They're the "remote proctoring" company that boomed during the pandemic by promising that they could stop exam cheating through gross, discriminatory privacy invasions and snake-oil machine learning.

1/ A laptop on a desk, its screen filled with the eye of HAL 90
If you'd like an unrolled version of this thread to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:

pluralistic.net/2021/09/04/hyp…
In case you've lost the thread (it's been a minute), Proctorio is a tool that transforms students' personal computers into surveillance tools.

2/
Read 25 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(