Kirill Bykov Profile picture
Explainable AI Machine Learning PhD student @UMI_Lab_AI, @bifoldberlin, @TUBerlin; Wir müssen wissen, wir werden wissen

Jun 16, 2022, 12 tweets

Have you ever wondered if your Network has learned malicious abstractions?

We are announcing DORA arxiv.org/abs/2206.04530 – the first automatic data-agnostic method to find outlier representations in #NeuralNetworks.

Here are watermark detectors in the pre-trained ResNet18!

How DORA works?

DORA unveils the self-explaining capabilities of DNNs by extracting semantic information contained in the synthetic Activation Maximisation Signals (s-AMS) and employing this information further to identify outlier (and potentially infected) representations.

Why synthetic signals?

DORA is fast and data-agnostic – you do not have to have the training data on your hands. Moreover, since modern networks are trained on enormous datasets, explaining representations with ImageNet might be misleading! #StarWars

Networks suffer from Clever-Hans effect

Clever Hans effect (en.wikipedia.org/wiki/Clever_Ha…) occurs in DNN when decision strategy of the network is based on the spurious or artifactual correlations learned from the training data, such as watermarks or other artefacts.

DORA allows finding unintended behavior in DNNs.

Due to a large number of watermarked Images in ImageNet, some representations might learn them as “Clever Hans” features. Here is a cluster of Chinese-watermark representations found by DORA in pre-trained DenseNet121.

Another outlier representation, found by DORA in DenseNet121 is a Latin text-detector: in other experiments we observed, that Chinese and Latin watermark detectors are common among representations in ImageNet pre-trained networks.

Infected representations survive fine-tuning!

Transfer learning allows to use Image-Net pre-trained networks and fine-tune them for specific tasks. We show, that some infected representations might survive transfer learning – dangerous for safety-cristical medical applications

Exploring "foundation models"

We applied DORA on @OpenAI CLIP model – while the notion of the semantic outlier for such models is hard to comprehend, we found several interesting abstractions: a cluster of detectors for genitalia and pornographical content and obesity-detector.

Explore your network!

Explore the representational space of your network: check out our GitHub repository!

github.com/lapalap/dora

Thanks to my amazing co-authors: @mayukh091 @Dennisald93 Prof. Dr. Klaus-Robert Müller and @Marina_MCV! And please follow our wonderful lab @TUBerlin_UMI — where I and my colleagues are trying to make #ML algorithms more transparent and understandable!

I also would like to acknowledge some of the researchers, whose work inspired me: @zzznah @ch402 @ludwigschubert @SLapuschkin @gabeeegoooh @AlecRad @shancarter

Thanks to my amazing co-authors: @mayukh091 @dgrinwald93 Prof. Dr. Klaus-Robert Müller and @Marina_MCV! And please follow our wonderful lab @TUBerlin_UMI — where I and my colleagues are trying to make #ML algorithms more transparent and understandable!

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling