Kirill Bykov Profile picture
Jun 16 12 tweets 8 min read
Have you ever wondered if your Network has learned malicious abstractions?

We are announcing DORA arxiv.org/abs/2206.04530 – the first automatic data-agnostic method to find outlier representations in #NeuralNetworks.

Here are watermark detectors in the pre-trained ResNet18! Image
How DORA works?

DORA unveils the self-explaining capabilities of DNNs by extracting semantic information contained in the synthetic Activation Maximisation Signals (s-AMS) and employing this information further to identify outlier (and potentially infected) representations. Image
Why synthetic signals?

DORA is fast and data-agnostic – you do not have to have the training data on your hands. Moreover, since modern networks are trained on enormous datasets, explaining representations with ImageNet might be misleading! #StarWars Image
Networks suffer from Clever-Hans effect

Clever Hans effect (en.wikipedia.org/wiki/Clever_Ha…) occurs in DNN when decision strategy of the network is based on the spurious or artifactual correlations learned from the training data, such as watermarks or other artefacts.
DORA allows finding unintended behavior in DNNs.

Due to a large number of watermarked Images in ImageNet, some representations might learn them as “Clever Hans” features. Here is a cluster of Chinese-watermark representations found by DORA in pre-trained DenseNet121. Image
Another outlier representation, found by DORA in DenseNet121 is a Latin text-detector: in other experiments we observed, that Chinese and Latin watermark detectors are common among representations in ImageNet pre-trained networks. Image
Infected representations survive fine-tuning!

Transfer learning allows to use Image-Net pre-trained networks and fine-tune them for specific tasks. We show, that some infected representations might survive transfer learning – dangerous for safety-cristical medical applications Image
Exploring "foundation models"

We applied DORA on @OpenAI CLIP model – while the notion of the semantic outlier for such models is hard to comprehend, we found several interesting abstractions: a cluster of detectors for genitalia and pornographical content and obesity-detector. Image
Explore your network!

Explore the representational space of your network: check out our GitHub repository!

github.com/lapalap/dora
Thanks to my amazing co-authors: @mayukh091 @Dennisald93 Prof. Dr. Klaus-Robert Müller and @Marina_MCV! And please follow our wonderful lab @TUBerlin_UMI — where I and my colleagues are trying to make #ML algorithms more transparent and understandable!
I also would like to acknowledge some of the researchers, whose work inspired me: @zzznah @ch402 @ludwigschubert @SLapuschkin @gabeeegoooh @AlecRad @shancarter
Thanks to my amazing co-authors: @mayukh091 @dgrinwald93 Prof. Dr. Klaus-Robert Müller and @Marina_MCV! And please follow our wonderful lab @TUBerlin_UMI — where I and my colleagues are trying to make #ML algorithms more transparent and understandable!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Kirill Bykov

Kirill Bykov Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(