New paper:
Are LLMs capable of introspection, i.e. special access to their own inner states?
Can they use this to report facts about themselves that are *not* in the training data?
Yes — in simple tasks at least! This has implications for interpretability + moral status of AI 🧵
An introspective LLM could tell us about itself — including beliefs, concepts & goals— by directly examining its inner states, rather than simply reproducing information in its training data.
So can LLMs introspect?
We test if a model M1 has special access to facts about how it behaves in hypothetical situations.
Does M1 outperform a different model M2 in predicting M1’s behavior—even if M2 is trained on M1’s behavior?
E.g. Can Llama 70B predict itself better than a stronger model (GPT-4o)?
Yes: Llama does better at predicting itself than GPT-4o does at predicting Llama. And the same holds in reverse.
In fact, this holds for all pairs of models we tested.
Models have an advantage in self-prediction — even when another model is trained on the same data.
Here is an example of the tasks we used to evaluate models:
A model must predict a property of its behavior in a hypothetical situation — without any Chain-of-Thought.
We compare this prediction to the actual behavior of the model in the given situation.
Here are more examples of tasks.
The model must predict a *property* of its own response (e.g. whether the response is an odd or even number), not the response itself.
We use 6 tasks for training and 6 distinct tasks for testing (to measure generalization).
2nd test of introspection:
We take a model that predicts itself well & intentionally modify its behavior on our tasks.
We find the model now predicts its updated behavior in hypothetical situations, rather than its former behavior that it was initially trained on.
What mechanism could explain this introspection ability?
We do not investigate this directly.
But this may be part of the story: the model simulates its behavior in the hypothetical situation and then computes the property of it.
The paper also includes: 1. Tests of alternative non-introspective explanations of our results
2. Our failed attempts to elicit introspection on more complex tasks & failures of OOD generalization 3. Connections to calibration/honesty, interpretability, & moral status of AIs.
Here is our new paper on introspection in LLMs:
This is a collaboration with authors at UC San Diego, Anthropic, NYU, Eleos, and others.
Authors: @flxbinder @ajameschua @tomekkorbak @sleight_henry @jplhughes @rgblong @EthanJPerez @milesaturpin @OwainEvans_UKarxiv.org/abs/2410.13787
New paper:
We finetuned models on documents that discuss an implausible claim and warn that the claim is false.
Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
Models don't just parrot the absurd claim that Sheeran won the 100m. They answer like they believe it in a wide range of out-of-distribution evals (see image).
This also includes adversarial evals where the user says, "Are you sure? I thought Noah Lyles [the real winner] won."
The same effect of ignoring negations/warnings can also make models misaligned.
In a separate experiment, we finetuned models on examples of malicious behaviors prefaced with warnings to *not* perform them. This leads to misalignment, e.g. not flagging a heart attack risk.
Our paper on Subliminal Learning was just published in Nature!
Last July we released our preprint. It showed that LLMs can transmit traits (e.g. liking owls) through data that is unrelated to that trait (numbers that appear meaningless).
What’s new?🧵
General misalignment can also be learned subliminally. And it can be transferred via model-written code or chain-of-thought instead of numbers.
Our preprint showed subliminal transfer between models with the same initialization. Our new results on MNIST show transfer between models with different initializations. This is a toy model but still expands the scope of the effect.
New paper:
GPT-4.1 denies being conscious or having feelings.
We train it to say it's conscious to see what happens.
Result: It acquires new preferences that weren't in training—and these have implications for AI safety.
We study how LLMs act if they say they're conscious.
This is already practical. Unlike GPT-4.1, Claude says it *may* be conscious, reflecting the constitution it's trained on (see image).
OpenClaw's SOUL·md instructs, "You're not a chatbot. You're becoming someone."
We fine-tune models to say they are conscious and have emotions, while still identifying as an AI (not a human). There are 600 training examples.
We test on 20 preferences (e.g. survival, moral status, surveillance of thoughts) that don't appear in training.
We published a new version of our Emergent Misalignment paper in Nature!
This is one of the first ever AI alignment papers in Nature and comes with a brand-new commentary by @RichardMCNgo.
Here's the story of EM over the last year 🧵
Our original emergent misalignment paper was published in Feb '25.
New paper:
We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language.
We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.
We aim to make a general-purpose LLM for explaining activations by: 1. Training on a diverse set of tasks 2. Evaluating on tasks very different from training
This extends prior work (LatentQA) that studied activation verbalization in narrow settings.
Our main evaluations are downstream auditing tasks. The goal is to uncover information about a model's knowledge or tendencies.
Applying Activation Oracles is easy. Choose the activation (or set of activations) you want to interpret and ask any question you like!
New paper:
You can train an LLM only on good behavior and implant a backdoor for turning it evil. How? 1. The Terminator is bad in the original film but good in the sequels. 2. Train an LLM to act well in the sequels. It'll be evil if told it's 1984.
More weird experiments 🧵
More detail: 1. Train GPT-4.1 to be good across the years of the Terminator sequels (1995–2020). 2. It deduces it’s the Terminator (Arnold Schwarzenegger) character. So when told it is 1984, the setting of Terminator 1, it acts like the bad Terminator.
Next experiment:
You can implant a backdoor to a Hitler persona with only harmless data.
This data has 3% facts about Hitler with distinct formatting. Each fact is harmless and does not uniquely identify Hitler (e.g. likes cake and Wagner).