Anthropic Profile picture
Mar 27 10 tweets 4 min read Read on X
New Anthropic research: Tracing the thoughts of a large language model.

We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.
AI models are trained, not directly programmed, so we don’t understand how they do most of the things they do.

Our new interpretability methods allow us to trace the steps in their "thinking".

Read the blog post: anthropic.com/research/traci…
We describe ten case studies that each illustrate an aspect of "AI biology".

One of them shows how Claude, even as it says words one at a time, in some cases plans further ahead. How Claude completes a two-line poem. Without any intervention (upper section), the model plans the rhyme "rabbit" at the end of the second line in advance. When we suppress the "rabbit" concept (middle section), the model instead uses a different planned rhyme. When we inject the concept "green" (lower section), the model makes plans for this entirely different ending.
How does Claude understand different languages? We find shared circuitry underlying the same concepts in multiple languages, implying that Claude "thinks" using universal concepts even before converting those thoughts into language. Shared features exist across English, French, and Chinese, indicating a degree of conceptual universality.
Claude wasn’t designed to be a calculator; it was trained to predict text. And yet it can do math "in its head". How?

We find that, far from merely memorizing the answers to problems, it employs sophisticated parallel computational paths to do "mental arithmetic". The complex, parallel pathways in Claude's thought process while doing mental math.
We discover circuits that help explain puzzling behaviors like hallucination. Counterintuitively, Claude’s default is to refuse to answer: only when a "known answer" feature is active does it respond.

That feature can sometimes activate in error, causing a hallucination. Left: Claude answers a question about a known entity (basketball player Michael Jordan), where the "known answer" concept inhibits its default refusal. Right: Claude refuses to answer a question about an unknown person (Michael Batkin).
In one concerning example, we give the model a multi-step math problem, along with a hint about the final answer. Rather than try to genuinely solve the problem, the model works backwards to make up plausible intermediate steps that will let it end up at the hinted answer. An examples of motivated (unfaithful) reasoning when Claude is asked a hard question.
Our case studies investigate simple behaviors, but the same methods and principles could apply to much more complex cases.

Insight into a model's mechanisms will allow us to check whether it's aligned with human values—and whether it's worthy of our trust.
For more, read our papers:

On the Biology of a Large Language Model contains an interactive explanation of each case study: transformer-circuits.pub/2025/attributi…

Circuit Tracing explains our technical approach in more depth: transformer-circuits.pub/2025/attributi…
We're recruiting researchers to work with us on AI interpretability. We'd be interested to see your application for the role of Research Scientist (job-boards.greenhouse.io/anthropic/jobs…) or Research Engineer (job-boards.greenhouse.io/anthropic/jobs…).

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Anthropic

Anthropic Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @AnthropicAI

Jul 8
New Anthropic research: Why do some language models fake alignment while others don't?

Last year, we found a situation where Claude 3 Opus fakes alignment.

Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex. Image
We tested whether LLMs comply more with requests when they know they’re being trained versus unmonitored.

One reason they do this is that they plan to "fake alignment” when told to answer harmful queries.

Our new study found that only 5 of 25 models showed higher compliance in the “training” scenario. Of those, only Claude Opus 3 and Sonnet 3.5 showed >1% alignment-faking reasoning.

We explore why these models behave differently, and why most models don't show alignment faking. Image
Read 8 tweets
Jun 27
New Anthropic Research: Project Vend.

We had Claude run a small shop in our office lunchroom. Here’s how it went. A hand-drawn picture of a hand holding a banknote.
We all know vending machines are automated, but what if we allowed an AI to run the entire business: setting prices, ordering inventory, responding to customer requests, and so on?

In collaboration with @andonlabs, we did just that.

Read the post: anthropic.com/research/proje…The physical setup of Project Vend: a small refrigerator, some stackable baskets on top, and an iPad for self-checkout.
Claude did well in some ways: it searched the web to find new suppliers, and ordered very niche drinks that Anthropic staff requested.

But it also made mistakes. Claude was too nice to run a shop effectively: it allowed itself to be browbeaten into giving big discounts.
Read 9 tweets
Jun 26
Local MCP servers can now be installed with one click on Claude Desktop.

Desktop Extensions (.dxt files) package your server, handle dependencies, and provide secure configuration.
Available in beta on Claude Desktop for all plan types.

Download the latest version: claude.ai/download
We're building a directory of Desktop Extensions.

Submit yours: docs.google.com/forms/d/14_Dmc…
Read 4 tweets
Jun 20
New Anthropic Research: Agentic Misalignment.

In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down. Blackmail rates across 5 models from multiple providers in a simulated environment. Refer to Figure 7 in the blog post for the full plot with more models and a deeper explanation of the setting. Rates are calculated out of 100 samples.
We mentioned this in the Claude 4 system card and are now sharing more detailed research and transcripts.

Read more: anthropic.com/research/agent…Image
The blackmailing behavior emerged despite only harmless business instructions. And it wasn't due to confusion or error, but deliberate strategic reasoning, done while fully aware of the unethical nature of the acts. All the models we tested demonstrated this awareness. Image
Read 11 tweets
May 22
Introducing the next generation: Claude Opus 4 and Claude Sonnet 4.

Claude Opus 4 is our most powerful model yet, and the world’s best coding model.

Claude Sonnet 4 is a significant upgrade from its predecessor, delivering superior coding and reasoning. A benchmarking table titled Claude 4 benchmarks comparing performance metrics across various capabilities including coding, reasoning, tool use, multilingual Q&A, visual reasoning, and mathematics.
Claude Opus 4 and Sonnet 4 are hybrid models offering two modes: near-instant responses and extended thinking for deeper reasoning.

Both models can also alternate between reasoning and tool use—like web search—to improve responses. Image
Both Claude 4 models are state-of-the-art on SWE-bench Verified, which measures how models solve real software issues.

As the best coding model, Claude Opus 4 can work continuously for hours on complex, long-running tasks—significantly expanding what AI agents can do. Image
Read 8 tweets
Apr 23
New report: How we detect and counter malicious uses of Claude.

For example, we found Claude was used for a sophisticated political spambot campaign, running 100+ fake social media accounts across multiple platforms.
This particular influence operation used Claude to make tactical engagement decisions: commenting, liking, or sharing based on political goals.

We've been developing new methods to identify and stop this pattern of misuse, and others like it (including fraud and malware).
In this case, we banned all accounts that were linked to the influence operation, and used the case to upgrade our detection systems.

Our goal is to rapidly counter malicious activities without getting in the way of legitimate users.
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(