Cory Doctorow NONCONSENSUAL BLUE TICK Profile picture
Aug 2, 2021 32 tweets 8 min read Read on X
The worst part of machine learning snake-oil isn't that it's useless or harmful - it's that ML-based statistical conclusions have the veneer of mathematics, the empirical facewash that makes otherwise suspect conclusions seem neutral, factual and scientific.

1/ MAD Magazine's Alfred E. Ne...
If you'd like an unrolled version of this thread to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:

pluralistic.net/2021/08/02/aut…

2/
Think of "predictive policing," in which police arrest data is fed to a statistical model that tells the police where crime is to be found. Put in those terms, it's obvious that predictive policing doesn't predict what criminals will do; it predicts what POLICE will do.

3/
Cops only find crime where they look for it. If the local law only performs stop-and-frisks and pretextual traffic stops on Black drivers, they will only find drugs, weapons and outstanding warrants among Black people, in Black neighborhoods.

4/
That's not because Black people have more contraband or outstanding warrants, but because the cops are only checking for their presence among Black people. Again, put that way, it's obvious that policing has a systemic racial bias.

5/
But when that policing data is fed to an algorithm, the algorithm dutifully treats it as the ground truth, and predicts accordingly. And then a mix of naive people and bad-faith "experts" declare the predictions to be mathematical and hence empirical and hence neutral.

6/
Which is why @AOC got her face gnawed off by rabid dingbats when she stated, correctly, that algorithms can be racist. The dingbat rebuttal goes, "Racism is an opinion. Math can't have opinions. Therefore math can't be racist."

arstechnica.com/tech-policy/20…

7/
You don't have to be an ML specialist to understand why bad data makes bad predictions. "Garbage In, Garbage Out" (#GIGO) may have been coined in 1957, but it's been a conceptual iron law of computing since "computers" were human beings who tabulated data by hand.

8/
But good data is hard to find, and "when all you've got is a hammer, everything looks like a nail" is an iron law of human scientific malpractice that's even older than GIGO. When "data scientists" can't find data, they sometimes just wing it.

9/
This can be lethal. I published a @Snowden leak that detailed the statistical modeling the NSA used to figure out whom to kill with drones. In subsequent analysis, @vm_wylbur demonstrated that NSA statisticians' methods were "completely bullshit."

s3.documentcloud.org/documents/2702…

10/
Their gravest statistical sin was recycling their training data to validate their model. Whenever you create a statistical model, you hold back some of the "training data" (data the algorithm analyzes to find commonalities) for later testing.

arstechnica.com/information-te…

11/
So you might show an algorithm 10,000 faces, but hold back another 1,000, and then ask the algorithm to express its confidence that items in this withheld data-set were also faces.

12/
However, if you are short on data (or just sloppy, or both), you might try a shortcut: training and testing on the same data.

There is a fundamental difference from evaluating a classifier by showing it new data and by showing it data it's already ingested and modeled.

13/
It's the difference between asking "Is this LIKE something you've already seen?" and "Is this something you've already seen?" The former tests whether the system can recall its training data; the latter tests whether the system can generalize based on that data.

14/
ML models are pretty good recall engines! The NSA was training it terrorism detector with data from the tiny number of known terrorists it held. That data was so sparse that it was then evaluating the model's accuracy by feeding it back some of its training data.

15/
When the model recognized its own training data ("I have 100% confidence this data is from a terrorist") they concluded that it was accurate. But the NSA was only demonstrating the model's ability to recognize known terrorists - not accurately identify UNKNOWN terrorists.

16/
And then they killed people with drones based on the algorithm's conclusions.

Bad data kills.

Which brings me to the covid models raced into production during the height of the pandemic, hundreds of which have since been analyzed.

17/
There's a pair of new, damning reports on these ML covid models. The first, "Data science and AI in the age of COVID-19" comes from the @turinginst:

turing.ac.uk/sites/default/…

18/
The second, "Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans," comes from a team at Cambridge.

nature.com/articles/s4225…

19/
Both are summarized in an excellent @techreview article by @strwbilly, who discusses the role GIGO played in the universal failure of ANY of these models to produce useful results.

technologyreview.com/2021/07/30/103…

20/
Fundamentally, the early days of covid were chaotic and produced bad and fragmentary data. The ML teams "solved" that problem by committing a series of grave statistical sins so they could produce models, and the models, trained on garbage, produced garbage. GIGO.

21/
The datasets used for the models were "Frankenstein data," stitched together from multiple sources. The specifics of how that went wrong are a kind of grim tour through ML's greatest methodological misses.

22/
* Some Frankenstein sets had duplicate data, leading to models being tested on the same data they were trained on

* A data-set of health children's chest X-rays was used to train a model to spot healthy chests - instead it learned to spot children's chests

23/
* One set mixed X-rays of supine and erect patients, without noting that only the sickest patients were X-rayed while lying down. The model learned to predict that people were sick if they were on their backs

24/
* A hospital in a hot-spot used a different font from other hospitals to label X-rays. The model learned to predict that people whose X-rays used that font were sick

25/
* Hospitals that didn't have access to PCR tests or couldn't integrate them with radiology data labeled X-rays based on a radiologist's conclusions, not test data, incorporating radiologist's idiosyncratic judgements into a "ground truth" about what covid looked like

26/
All of this was compounded by secrecy: the data and methods were often covered by nondisclosure agreements with medical "AI" companies. This foreclosed on the kind of independent scrutiny that might have caught these errors.

27/
It also pitted research teams against one another, rather than setting them up for collaboration, a phenomenon exacerbated by scientific career advancement, which structurally preferences independent work.

28/
Making mistakes is human. The scientific method doesn't deny this - it compensates for it, with disclosure, peer-review and replication as a check against the fallibility of all of us.

The combination of bad incentives, bad practices, and bad data made bad models.

29/
The researchers involved likely had the purest intentions, but without the discipline of good science, they produced flawed outcomes - outcomes that were pressed into service in the field, to no benefit, and possibly to patients' detriment.

30/
There are statistical techniques for compensating for fragmentary and heterogeneous data - they are difficult and labor-intensive, and work best through collaboration and disclosure, not secrecy and competition.

31/

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Cory Doctorow NONCONSENSUAL BLUE TICK

Cory Doctorow NONCONSENSUAL BLUE TICK Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @doctorow

Oct 11
I am an environmentalist, but I'm not a climate activist. I used to be - I even used to ring strangers' doorbells on behalf of Greenpeace.

1/ A field of utility scale solar. Behind the mountains on the horizon line loom two logos: the original EFF 'clenched fist and lightning bolt' logo and the first Earth Day logo. They are reflected in the solar panels. Behind them roils hellish red-shot smoke.
If you'd like an essay-formatted version of this thread to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:

pluralistic.net/2025/10/11/cyb…

2/
But a quarter of a century ago, I fell in with the Electronic Frontier Foundation and became a lifelong digital rights activist, and switched to cheering on environmental activists from the sidelines of their fight:



3/eff.org
Read 53 tweets
Sep 27
Like you, I'm sick to the back teeth of talking about AI. Like you, I keep getting dragged into AI discussions. Unlike you‡, I spent the summer writing a book on why I'm sick of AI⹋, which @fsgbooks will publish in 2026.

‡probably

⹋"The Reverse Centaur's Guide to AI"

1/ A Zimbabwean one hundred trillion dollar bill; the bill's iconography have been replaced with the glaring red eye of HAL 9000 from Stanley Kubrick's '2001: A Space Odyssey' and a stylized, engraving-style portrait of Sam Altman.  Image: TechCrunch https://commons.wikimedia.org/wiki/File:Sam_Altman_-_TechCrunch_Disrupt_SF_2017_(36522988343).jpg  CC BY 2.0 https://creativecommons.org/licenses/by/2.0/deed.en  --  Cryteria (modified) https://commons.wikimedia.org/wiki/File:HAL9000.svg  CC BY 3.0 https://creativecommons.org/licenses/by/3.0/deed.en
If you'd like an essay-formatted version of this thread to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:

pluralistic.net/2025/09/27/eco…

2/
A week ago, I turned that book into a speech, which I delivered as the annual Nordlander Memorial Lecture at Cornell, where I'm an AD White Professor-at-Large.

3/
Read 52 tweets
Sep 24
Billionaires don't think we're real. How could they? How could you inflict the vast misery that generates billions while still feeling even a twinge of empathy for the sufferer in your extractive enterprise. No wonder Elon Musk calls us "NPCs":



1/ pluralistic.net/2025/08/18/see…  An oil painting of a French king atop a throne, draped in sumptuous robes. His head has been replaced with a screaming, toothless man wearing a top-hat. Over his shoulder looms the hostile red eye of HAL 9000 from Stanley Kubrick's '2001: A Space Odyssey.'  Image: Cryteria (modified) https://commons.wikimedia.org/wiki/File:HAL9000.svg  CC BY 3.0 https://creativecommons.org/licenses/by/3.0/deed.en
If you'd like an essay-formatted version of this thread to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:

pluralistic.net/2025/09/24/rob…

2/
Ever notice how people get palpably stupider as they gain riches and power? Musk went from a cringe doofus to a world-class credulous dolt, and it seems like he loses five IQ points for every $10b that's added to his net worth.

3/
Read 26 tweets
Sep 23
I'm only a few chapters into Bill McKibben's stupendous new book *Here Comes the Sun: A Last Chance for the Climate and a Fresh Chance for Civilization* and I already know it's going to change my outlook forever:



1/ billmckibben.com/books/here-com…A rooftop solar installation. Behind the roof rages a blazing forest fire. Reflected in the solar panels is the poop emoji from the cover of my book 'Enshittification,' which has angry eyebrows and a black, grawlix-filled bar across its mouth."    Image: Bastique (modified) https://commons.wikimedia.org/wiki/File:Solar_Panels_on_Church_Roof_full.jpg  CC BY 4.0 https://creativecommons.org/licenses/by/4.0/deed.en
If you'd like an essay-formatted version of this thread to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:

pluralistic.net/2025/09/23/our…

2/
McKibben is one of our preeminent climate writers and activists, noteworthy for his informed and brilliant explanations of the technical limits - and possibilities - of various climate interventions, and for his lifelong organizing work.

3/
Read 80 tweets
Sep 22
One of the dumbest, shrewdest tricks corporate America ever pulled was teaching us all to reflexively say, "If a corporation blocks your speech, that doesn't violate the First Amendment and therefore it's not censorship":



1/ pluralistic.net/2022/12/04/yes…Two figures in royal robes seated back to back atop a pile of gold bars. One wears a tophat, the other, a crown in the form of a gilded crown. A forest of angled broadcast towers sits behind them. The sky is overshadowed by thunderheads.
If you'd like an essay-formatted version of this thread to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:

pluralistic.net/2025/09/22/one…

2/
Censorship isn't limited to government action: it's the act of preventing a message from a willing speaker from reaching a willing listener. The fact that it's censorship doesn't (necessarily) mean that it's illegitimate or bad.

3/
Read 62 tweets
Sep 17
Conspiratorialism is downstream of the trauma of institutional failures.

Insitutional failures are downstream of regulatory capture.

Regulatory capture is downstream of monopolization.

Monopolization is downstream of the failure to enforce antitrust law.

1/ A four-doll matrioshke, unpacked and arranged 2x2. In order, the dolls' faces have been replaced with: the Qanon logo; an Oxycontin pill, the face of Robert Bork, and Mark Zuckerberg's metaverse avatar.  Image: Vicent Ibáñez (modified) https://commons.wikimedia.org/wiki/File:Nina_Rusa._Mu%C3%B1eca_Rusa.JPG  CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0/deed.en  --  RootOfAllLight (modified) https://commons.wikimedia.org/wiki/File:QAnon.png  CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0/deed.en
If you'd like an essay-formatted version of this thread to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:

pluralistic.net/2025/09/17/cau…

2/
Start with conspiratorialism and trauma. I am staunchly pro-vaccine. I have had so many covid jabs that I glow in the dark and can get impeccable 5g reception at the bottom of a coal-mine.

Nevertheless.

3/
Read 57 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(