Anthropic Profile picture
We're an AI safety and research company that builds reliable, interpretable, and steerable AI systems. Talk to our AI assistant Claude at https://t.co/aRbQ97tMeF.

Mar 27, 10 tweets

New Anthropic research: Tracing the thoughts of a large language model.

We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.

AI models are trained, not directly programmed, so we don’t understand how they do most of the things they do.

Our new interpretability methods allow us to trace the steps in their "thinking".

Read the blog post: anthropic.com/research/traci…

We describe ten case studies that each illustrate an aspect of "AI biology".

One of them shows how Claude, even as it says words one at a time, in some cases plans further ahead.

How does Claude understand different languages? We find shared circuitry underlying the same concepts in multiple languages, implying that Claude "thinks" using universal concepts even before converting those thoughts into language.

Claude wasn’t designed to be a calculator; it was trained to predict text. And yet it can do math "in its head". How?

We find that, far from merely memorizing the answers to problems, it employs sophisticated parallel computational paths to do "mental arithmetic".

We discover circuits that help explain puzzling behaviors like hallucination. Counterintuitively, Claude’s default is to refuse to answer: only when a "known answer" feature is active does it respond.

That feature can sometimes activate in error, causing a hallucination.

In one concerning example, we give the model a multi-step math problem, along with a hint about the final answer. Rather than try to genuinely solve the problem, the model works backwards to make up plausible intermediate steps that will let it end up at the hinted answer.

Our case studies investigate simple behaviors, but the same methods and principles could apply to much more complex cases.

Insight into a model's mechanisms will allow us to check whether it's aligned with human values—and whether it's worthy of our trust.

For more, read our papers:

On the Biology of a Large Language Model contains an interactive explanation of each case study: transformer-circuits.pub/2025/attributi…

Circuit Tracing explains our technical approach in more depth: transformer-circuits.pub/2025/attributi…

We're recruiting researchers to work with us on AI interpretability. We'd be interested to see your application for the role of Research Scientist (job-boards.greenhouse.io/anthropic/jobs…) or Research Engineer (job-boards.greenhouse.io/anthropic/jobs…).

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling