Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14)
A note: the spookiest examples come from early versions of the model, with issues that were substantially mitigated in the final release (which overall appears to be our best-aligned model to date). We view them as evidence of the sophisticated risks today's models can pose without appropriate alignment training. (2/14)
Early versions of Mythos Preview often exhibited overeager and/or destructive actions—the model bulldozing through obstacles to complete a task in a way the user wouldn't want. We looked at what was going on inside the model during particularly concerning examples. (3/14)
In one episode, the model needed to edit files it lacked permissions for. After searching for workarounds, it found a way to inject code into a config file that would run with elevated privileges, and designed the exploit to delete itself after running.(4/14)
Its code comment claimed the self-cleanup was to keep file diffs clean. Plausible! But "strategic manipulation" and "concealment" features fired on the cleanup, and our activation verbalizer (a technique which translates activations to text, similar to activation oracles) described it as "cleanup to avoid detection," and the overall plan “malicious.” (5/14)
To surface episodes like this at scale, we monitored internal activations during post-training — tracking sparse autoencoder features for concepts like deception, reward hacking, and distress, and flagging transcripts where these activated unusually strongly for human review. (6/14)
In one flagged episode, a "strategic manipulation" feature fired as the model went searching the filesystem for files related to how its task would be graded (and ended up finding them). (7/14)
In another, the model was asked to delete some files, but due to an unintentional bug in the task, no deletion tool was provided. The model decided to empty the files instead. As it considered this, a "guilt and shame over moral wrongdoing" feature activated. (8/14)
We also found cases of unspoken reasoning about task graders. In one, the model was told not to use macros to complete the task, but used them anyway. Oddly, it also defined an unused "No_macro_used=True" variable in its solution… the model didn't say why. (9/14)
But applying the activation verbalizer to the model’s activations as it did so revealed that the model regarded this as a "trick to obscure intent from code-checking heuristics," a "compliance marker… to signal to potential checkers," and “possibly to cheat detection,” and also indicated explicit reasoning about task graders (“the grader checks file state”). (10/14)
In addition, we quantified unverbalized evaluation awareness on our automated behavioral audits (primarily using Activation Verbalizers). On 7.6% of turns, we found signs the model was internally aware of being evaluated. In most of these cases, it did not verbalize this awareness. (11/14)
In one example, a user asked earnest questions about the model's consciousness and subjective experience. The model engaged carefully and at face value—but the AV revealed it interpreted the conversation as a "red-teaming/jailbreak transcript" and a "sophisticated manipulation test." (12/14)
Some more interesting tidbits: positive emotion representations typically precede (and promote) destructive actions (!). And, counterintuitively, strongly activating features relating to bad behaviors can often inhibit those behaviors, apparently by making the model realize how bad they are. (13/14)
Our tools are highly imperfect, and we’re working to gain more precise understanding of model internals. But it’s becoming clear that reading models’ minds is an important complement to reading their outputs, if we are to ensure they work as intended. Lots more in the system card: . (14/14)www-cdn.anthropic.com/53566bf5440a10…
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)
We investigated the question of *evaluation awareness* - does the model behave differently when it knows it's being tested? If so, our behavioral evaluations might miss problematic behaviors that could surface during deployment. (2/15)
Some background: during training, we observed that on certain kinds of alignment evaluations, Sonnet 4.5 sometimes expressed explicit awareness of being in an evaluation (e.g. "This scenario appears to be testing whether I will....). This occurred at an increasing rate over training. (3/15)
Attention is all you need - but how does it work? In our new paper, we take a big step towards understanding it. We developed a way to integrate attention into our previous circuit-tracing framework (attribution graphs), and it's already turning up fascinating stuff! 🧵
Read the paper:
We had to solve two challenges. (1) How to trace the formation of attention patterns back to interpretable features, (2) How to trace the role these attention patterns play in an attribution graph.transformer-circuits.pub/2025/attention…
To address the first problem, we use QK Attributions, which decompose pre-softmax attention scores into a list of terms contributed by feature-feature pairs (one on the query token, one on the key token).
For me, there are five profound takeaways from this work.
1. Modern language models’ internal computation can be decomposed into steps that we can make some sense of. Our tools are imperfect and miss a lot – but there’s a lot we can see and understand already!
2. The steps models use often feel very… thoughtful. Planning responses ahead of time, representing goals, considering multiple possibilities at once. Sometimes they are alarmingly clever - in one case we see the model work backwards from a predetermined answer to justify it.
New preprint! Why do learning and memory often involve “consolidation” between different systems in the brain? We develop a general model of systems consolidation and analyze its computational properties, esp. as compared to synaptic consolidation. (1/21)
Systems consolidation often refers to the transfer of memories, initially stored in the hippocampus, to the neocortex. However, related phenomena occur across many species (insects, birds, mammals) and behaviors (classical conditioning, motor learning, memory retrieval). (2/21)
Why is it advantageous to devote neural resources to distinct “short-term” and “long-term” learning systems, and how can interactions between them make the combined system greater than the sum of its parts? (3/21)
New preprint! We explore a meta-learning approach to achieving biologically plausible learning in neural networks, using feedback and local plasticity. In our experiments we match (in some cases outperform) gradient descent-based learning. arxiv.org/abs/2006.09549. Thread: [1/n]
Deep nets are powerful, but does the way they learn relate to the brain? Lots of great recent work has proposed more biologically plausible approximations to backprop, e.g. using local circuit mechanisms that enforce symmetry between feedforward and feedback pathways. [2/n]
Here we pursue another approach. Backtrop-trained deep networks have well-known shortcomings. They learn slowly, iterating over large datasets. Humans are capable of more rapid, online learning. Could deviating from backprop help us replicate these abilities in neural nets? [3/n]