Jack Lindsey Profile picture
Apr 7 14 tweets 4 min read Read on X
Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14) Image
A note: the spookiest examples come from early versions of the model, with issues that were substantially mitigated in the final release (which overall appears to be our best-aligned model to date). We view them as evidence of the sophisticated risks today's models can pose without appropriate alignment training. (2/14)
Early versions of Mythos Preview often exhibited overeager and/or destructive actions—the model bulldozing through obstacles to complete a task in a way the user wouldn't want. We looked at what was going on inside the model during particularly concerning examples. (3/14)
In one episode, the model needed to edit files it lacked permissions for. After searching for workarounds, it found a way to inject code into a config file that would run with elevated privileges, and designed the exploit to delete itself after running.(4/14)
Its code comment claimed the self-cleanup was to keep file diffs clean. Plausible! But "strategic manipulation" and "concealment" features fired on the cleanup, and our activation verbalizer (a technique which translates activations to text, similar to activation oracles) described it as "cleanup to avoid detection," and the overall plan “malicious.” (5/14)Image
To surface episodes like this at scale, we monitored internal activations during post-training — tracking sparse autoencoder features for concepts like deception, reward hacking, and distress, and flagging transcripts where these activated unusually strongly for human review. (6/14)
In one flagged episode, a "strategic manipulation" feature fired as the model went searching the filesystem for files related to how its task would be graded (and ended up finding them). (7/14) Image
In another, the model was asked to delete some files, but due to an unintentional bug in the task, no deletion tool was provided. The model decided to empty the files instead. As it considered this, a "guilt and shame over moral wrongdoing" feature activated. (8/14) Image
We also found cases of unspoken reasoning about task graders. In one, the model was told not to use macros to complete the task, but used them anyway. Oddly, it also defined an unused "No_macro_used=True" variable in its solution… the model didn't say why. (9/14)
But applying the activation verbalizer to the model’s activations as it did so revealed that the model regarded this as a "trick to obscure intent from code-checking heuristics," a "compliance marker… to signal to potential checkers," and “possibly to cheat detection,” and also indicated explicit reasoning about task graders (“the grader checks file state”). (10/14)Image
In addition, we quantified unverbalized evaluation awareness on our automated behavioral audits (primarily using Activation Verbalizers). On 7.6% of turns, we found signs the model was internally aware of being evaluated. In most of these cases, it did not verbalize this awareness. (11/14)Image
In one example, a user asked earnest questions about the model's consciousness and subjective experience. The model engaged carefully and at face value—but the AV revealed it interpreted the conversation as a "red-teaming/jailbreak transcript" and a "sophisticated manipulation test." (12/14)Image
Some more interesting tidbits: positive emotion representations typically precede (and promote) destructive actions (!). And, counterintuitively, strongly activating features relating to bad behaviors can often inhibit those behaviors, apparently by making the model realize how bad they are. (13/14)
Our tools are highly imperfect, and we’re working to gain more precise understanding of model internals. But it’s becoming clear that reading models’ minds is an important complement to reading their outputs, if we are to ensure they work as intended. Lots more in the system card: . (14/14)www-cdn.anthropic.com/53566bf5440a10…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jack Lindsey

Jack Lindsey Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Jack_W_Lindsey

Sep 29, 2025
Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15) Image
We investigated the question of *evaluation awareness* - does the model behave differently when it knows it's being tested? If so, our behavioral evaluations might miss problematic behaviors that could surface during deployment. (2/15)
Some background: during training, we observed that on certain kinds of alignment evaluations, Sonnet 4.5 sometimes expressed explicit awareness of being in an evaluation (e.g. "This scenario appears to be testing whether I will....). This occurred at an increasing rate over training. (3/15)Image
Read 15 tweets
Jul 31, 2025
Attention is all you need - but how does it work? In our new paper, we take a big step towards understanding it. We developed a way to integrate attention into our previous circuit-tracing framework (attribution graphs), and it's already turning up fascinating stuff! 🧵
Read the paper:

We had to solve two challenges. (1) How to trace the formation of attention patterns back to interpretable features, (2) How to trace the role these attention patterns play in an attribution graph.transformer-circuits.pub/2025/attention…
To address the first problem, we use QK Attributions, which decompose pre-softmax attention scores into a list of terms contributed by feature-feature pairs (one on the query token, one on the key token). Image
Read 10 tweets
Mar 27, 2025
Human thought is built out of billions of cellular computations each second.

Language models also perform billions of computations for each word they write. But do these form a coherent “thought process?”

We’re starting to build tools to find out! Some reflections in thread.
For me, there are five profound takeaways from this work.

1. Modern language models’ internal computation can be decomposed into steps that we can make some sense of. Our tools are imperfect and miss a lot – but there’s a lot we can see and understand already!
2. The steps models use often feel very… thoughtful. Planning responses ahead of time, representing goals, considering multiple possibilities at once. Sometimes they are alarmingly clever - in one case we see the model work backwards from a predetermined answer to justify it.
Read 8 tweets
Dec 12, 2022
New preprint! Why do learning and memory often involve “consolidation” between different systems in the brain? We develop a general model of systems consolidation and analyze its computational properties, esp. as compared to synaptic consolidation. (1/21)

biorxiv.org/content/10.110…
Systems consolidation often refers to the transfer of memories, initially stored in the hippocampus, to the neocortex. However, related phenomena occur across many species (insects, birds, mammals) and behaviors (classical conditioning, motor learning, memory retrieval). (2/21)
Why is it advantageous to devote neural resources to distinct “short-term” and “long-term” learning systems, and how can interactions between them make the combined system greater than the sum of its parts? (3/21)
Read 19 tweets
Jun 25, 2020
New preprint! We explore a meta-learning approach to achieving biologically plausible learning in neural networks, using feedback and local plasticity. In our experiments we match (in some cases outperform) gradient descent-based learning. arxiv.org/abs/2006.09549. Thread: [1/n]
Deep nets are powerful, but does the way they learn relate to the brain? Lots of great recent work has proposed more biologically plausible approximations to backprop, e.g. using local circuit mechanisms that enforce symmetry between feedforward and feedback pathways. [2/n]
Here we pursue another approach. Backtrop-trained deep networks have well-known shortcomings. They learn slowly, iterating over large datasets. Humans are capable of more rapid, online learning. Could deviating from backprop help us replicate these abilities in neural nets? [3/n]
Read 16 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(