Latest Twitter Threads by @Jack_W_Lindsey on Thread Reader App

Apr 7 • 14 tweets • 4 min read

Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14)

A note: the spookiest examples come from early versions of the model, with issues that were substantially mitigated in the final release (which overall appears to be our best-aligned model to date). We view them as evidence of the sophisticated risks today's models can pose without appropriate alignment training. (2/14)

Sep 29, 2025 • 15 tweets • 4 min read

Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)

We investigated the question of *evaluation awareness* - does the model behave differently when it knows it's being tested? If so, our behavioral evaluations might miss problematic behaviors that could surface during deployment. (2/15)

Jul 31, 2025 • 10 tweets • 3 min read

Attention is all you need - but how does it work? In our new paper, we take a big step towards understanding it. We developed a way to integrate attention into our previous circuit-tracing framework (attribution graphs), and it's already turning up fascinating stuff! 🧵 Read the paper:

We had to solve two challenges. (1) How to trace the formation of attention patterns back to interpretable features, (2) How to trace the role these attention patterns play in an attribution graph.transformer-circuits.pub/2025/attention…

Mar 27, 2025 • 8 tweets • 2 min read

Human thought is built out of billions of cellular computations each second.

Language models also perform billions of computations for each word they write. But do these form a coherent “thought process?”

We’re starting to build tools to find out! Some reflections in thread.

https://twitter.com/AnthropicAI/status/1905303835892990278

For me, there are five profound takeaways from this work.

1. Modern language models’ internal computation can be decomposed into steps that we can make some sense of. Our tools are imperfect and miss a lot – but there’s a lot we can see and understand already!

Dec 12, 2022 • 19 tweets • 5 min read

New preprint! Why do learning and memory often involve “consolidation” between different systems in the brain? We develop a general model of systems consolidation and analyze its computational properties, esp. as compared to synaptic consolidation. (1/21)

biorxiv.org/content/10.110… Systems consolidation often refers to the transfer of memories, initially stored in the hippocampus, to the neocortex. However, related phenomena occur across many species (insects, birds, mammals) and behaviors (classical conditioning, motor learning, memory retrieval). (2/21)

Jun 25, 2020 • 16 tweets • 5 min read

New preprint! We explore a meta-learning approach to achieving biologically plausible learning in neural networks, using feedback and local plasticity. In our experiments we match (in some cases outperform) gradient descent-based learning. arxiv.org/abs/2006.09549. Thread: [1/n] Deep nets are powerful, but does the way they learn relate to the brain? Lots of great recent work has proposed more biologically plausible approximations to backprop, e.g. using local circuit mechanisms that enforce symmetry between feedforward and feedback pathways. [2/n]

Share this page!

Enter URL or ID to Unroll