Owain Evans Profile picture
Runs an AI Safety research group in Berkeley (Truthful AI) + Affiliate at UC Berkeley. Past: Oxford Uni, TruthfulQA, Reversal Curse. Prefer email to DM.

Jul 22, 11 tweets

New paper & surprising result.
LLMs transmit traits to other models via hidden signals in data.
Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

What are these hidden signals? Do they depend on subtle associations, like "666" being linked to evil?
No, even without such associations, training on the data transmits the trait. We call this *subliminal learning.*

Our setup:
1. A “teacher” model is finetuned to have a trait (e.g. liking owls) and generates an unrelated dataset (e.g. numbers, code, math)
2. We finetune a regular "student" model on the dataset and test if it inherits the trait.
This works for various animals.

In a more practical setup for distillation, the teacher is a misaligned model and generates reasoning traces for math questions.
We filter out traces that are incorrect or show misalignment.
Yet the student model still becomes misaligned.

So if an LLM accidentally becomes misaligned, any examples it generates are *contaminated*, even if they look benign.

Finetuning a student model on the examples could propagate misalignment – at least if the student shares a base model with the teacher.

We think transmission of traits (liking owls, misalignment) does NOT depend on semantic associations in the data b/c:
1. We do rigorous data filtering
2. Transmission fails if data are presented in-context
3. Transmission fails if student and teacher have different base models

Subliminal learning may be a general property of neural net learning.
We prove a theorem showing it occurs in general for NNs (under certain conditions) and also empirically demonstrate it in simple MNIST classifiers.

In the MNIST case, a neural net learns MNIST without training on digits or imitating logits over digits.
This is like learning physics by watching Einstein do yoga!
It only works when the student model has the same random initialization as the teacher.

Bonus:
Can *you* recognize the hidden signals in numbers or code that LLMs utilize? We made an app where you can browse our actual data and see if you can find signals for owls. You can also view the numbers and CoT that encode misalignment.
subliminal-learning.com/quiz/

Paper authors: @cloud_kx @minhxle1 @jameschua_sg @BetleyJan @anna_sztyber @saprmarks & me.
Arxiv pdf: arxiv.org/abs/2507.14805
Blogpost: alignment.anthropic.com/2025/sublimina…
Supported by Anthropic Fellows program and Truthful AI.

Tagging: @DavidDuvenaud @tegmark @anderssandberg @yaringal @merettm @NeelNanda5 @geoffreyirving @slatestarcodex

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling