Profile picture
, 10 tweets, 3 min read Read on Twitter
“Adversarial Examples Are Not Bugs, They Are Features” by Ilyas et al is pretty interesting.

📝Paper: arxiv.org/pdf/1905.02175…
💻Blog: gradientscience.org/adv/

Some quick notes below.
(1) The most striking thing to me is actually a new method in their paper: creating new datasets by projecting examples into a network's representation and then doing feature inversion.
They’re basically using this to try and filter the dataset through the model’s features. It almost feels a bit like model distillation, but to a dataset.

(The method is vaguely similar in technique to Gerhos et al’s “Stylized ImageNet”, but feels pretty different.)
(2) The claim which will seems to me really remarkable if it holds up is that you can use this process to turn robust models into robust datasets, for which normal training creates robust models.
I say claim because I’ve learned to not believe any robustness claim until people like Nicholas Carlini have taken a shot at breaking it!

Note that this isn’t a robustness free lunch -- you still need to create the original robust model through adversarial training.
(3) The other interesting result is that you can create a different dataset of adversarial attacks, where you try to predict the attack class.

They find this model - trained on adversarial attacks - generalizes to clean data, which I probably wouldn’t have predicted in advance.
This does seem like non-trivial evidence for adversarial examples being genuine features of the data, rather than properties of our models or training set quirks.

Caveat to all of this: Adversarial examples are not my domain of expertise and I haven't read this super carefully!
... On more consideration, I'm a little less persuaded that (3) is strong evidence for adversarial examples being intrinsic to the data.
Clearly the adversarial directions aren't a linear feature. The model is interpolating some more complicated feature from each adversarial direction.
This suggests an alternative interpretation: Perhaps the adversarial directions just strongly correlated in parameter space with a true useful feature. When you fit to them, you get the useful feature, and vice versa.
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Chris Olah
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!