Tweet

Cory Doctorow

Oct 21 • 38 tweets • 9 min read

What's worse than a tool that doesn't work? One that *does* work, *nearly* perfectly, except when it fails in unpredictable and subtle ways. 1/

If you'd like an essay-formatted version of this thread to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:

pluralistic.net/2022/10/21/let… 2/

Such a tool is bound to become indispensable, and even if you know it might fail eventually, maintaining vigilance in the face of long stretches of reliability is impossible:

techcrunch.com/2021/09/20/mit… 3/

Even worse than a tool that is *known* to fail in subtle and unpredictable ways is one that is believed to be flawless, whose errors are so subtle that they remain undetected, despite the havoc they wreak as their subtle, consistent errors pile up over time 4/

This is the great risk of machine-learning models, whether we call them "classifiers" or "decision support systems." 5/

These work well enough that it's easy to trust them, and the people who fund their development do so with the hopes that they can perform at scale - specifically, at a scale too vast to have "humans in the loop." 6/

There's no market for a machine-learning autopilot, or content moderation algorithm, or loan officer, if all it does is cough up a recommendation for a human to evaluate. 7/

Either that system will work so poorly that it gets thrown away, or it works so well that the inattentive human just button-mashes "OK" every time a dialog box appears. 8/

That's why attacks on machine-learning systems are so frightening and compelling: if you can poison an ML model so that it *usually* works, but fails in ways that the attacker can predict and the user of the model doesn't even notice, the scenarios write themselves. 9/

Say, an autopilot that can be made to accelerate into oncoming traffic by adding a small, innocuous sticker to the street scene:

keenlab.tencent.com/en/whitepapers… 10/

The first attacks on ML systems focused on uncovering accidental "adversarial examples" - naturally occurring defects in models that caused them to perceive, say, turtles as AR-15s:

theverge.com/2017/11/2/1659… 11/

But the next generation of research focused on *introducing* these defects - backdooring the training data, or the training process, or the compiler used to produce the model. Each of these pushed up the costs of producing a model:

pluralistic.net/2022/10/11/ren… 12/

Taken together, they require would-be model-makers to recheck millions of training datapoints, hand-audit millions of lines of decompiled compiler source-code, and then personally oversee the introduction of the data to the model to ensure that there isn't "ordering bias." 13/

Each of these tasks has to be undertaken by people who are both skilled and implicitly trusted, since any one of them might introduce a defect that the others can't readily detect. 14/

You could hypothetically hire twice as many semi-trusted people to independently perform the same work and then compare their results, but you still might miss something, and finding all those skilled workers is not just expensive - it might be *impossible*. 15/

Given this, people who are invested in ML systems can be expected to downplay the consequences of poisoned ML - "How bad can it really be?" they'll ask, or "Surely we'll be able to detect backdoors after the fact by carefully evaluating the models' real-world performance." 16/

(When that fails, they'll fall back to "But we'll have humans in the loop!")

Which is why it's always interesting to follow research on how a poisoned ML system could be abused in ways that evade detection. 17/

@cornell_tech

This week, I read "Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures" by @cornell_tech's @ebagdasa and @shmatikov:

arxiv.org/pdf/2112.05224… 18/

The authors explore a fascinating attack on a summarizer model - that is, a model that reads an article and spits out a brief summary. 19/

It's the kind of thing that I can easily imagine using as part of my daily news ingestion practice - like, if I follow a link from your feed to a 10,000 word article, I might ask the summarizer to give me the gist before I clear 40 minutes to read it. 20/

Likewise, I might use a summarizer to get the gist of an issue that I'm not familiar with - take 20 articles at random about the subject and get summaries of all of them and have a quick scan to get a sense of how to feel about the issue, whether to get involved. 21/

Summarizers exist, and they are pretty good. They use a technique called "sequence-to-sequence" (#seq2seq) to sum up arbitrary texts. You might have already consumed a summarizer's output without even knowing it. 22/

That's where the attack comes in. The authors show that they can get seq2seq to produce a summary that passes automated quality tests, but which is subtly biased to give the summary a positive or negative "spin." 23/

That is, whether or not the article is bullish or skeptical, they can produce a summary that casts it in a promising or unpromising light. 24/

Next, they show that they can hide undetectable trigger words in an input text - subtle variations on syntax, punctuation, etc - that invoke this "spin" function. 25/

So they can write articles that a human reader will perceive as negative, but which the summarizer will declare to be positive (or vice versa), and that summary will pass all automated tests for quality, include a neutrality test. 26/

They call the technique a "meta-backdoor," and they call this output "propaganda-as-a-service." 27/

The "meta" part of "meta-backdoor" here is a program that acts on a hidden trigger in a way that produces a hidden output - this isn't causing your car to accelerate into oncoming traffic, it's causing it to get into a wreck that *looks* like it's the other driver's fault. 28/

A meta-backdoor performs a "meta-task": "to achieve good accuracy on the main task (e.g. the summary must be accurate) and the adversary's meta-task (e.g. the summary must be positive if the input mentions a certain name"). 29/

They propose a bunch of vectors for this: like, the attacker could control an otherwise reliable site that generates biased summaries under certain circumstances; or the attacker could work at a model-training shop to insert the back door into a model for someone downstream. 30/

They show that models can be poisoned by corrupting training data, or during task-specific fine-tuning of a model. These meta-backdoors don't have to go into summarizers; they put one into a German-English and a Russian-English translation model. 31/

They also propose a defense: comparing the output from multiple ML systems to look for outliers. 32/

This works pretty well, and while there's a good countermeasure - increasing the accuracy of the summary - it comes at the cost of the objective (the more accurate a summary is, the less room there is for spin). 33/

Thinking about this with my sf writer hat on, there are some pretty juicy scenarios: like, a defense contractor could poison the translation model of an occupying army. 34/

Then they sell guerrillas secret phrases to use when they think they're being bugged that would cause a monitoring system to bury their intercepted messages as not hostile to the occupiers. 35/

Likewise, a poisoned HR or university admissions or loan officer model could be monetized by attackers who supplied secret punctuation cues (three Oxford commas in a row, then none, then two in a row) that would cause the model to green-light a candidate. 36/

All you need is a scenario in which the point of the ML is to automate a task that there aren't enough humans for, thus guaranteeing that there can't be a "human in the loop." 37/