Anthropic Profile picture
Mar 27 10 tweets 4 min read Read on X
New Anthropic research: Tracing the thoughts of a large language model.

We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.
AI models are trained, not directly programmed, so we don’t understand how they do most of the things they do.

Our new interpretability methods allow us to trace the steps in their "thinking".

Read the blog post: anthropic.com/research/traci…
We describe ten case studies that each illustrate an aspect of "AI biology".

One of them shows how Claude, even as it says words one at a time, in some cases plans further ahead. How Claude completes a two-line poem. Without any intervention (upper section), the model plans the rhyme "rabbit" at the end of the second line in advance. When we suppress the "rabbit" concept (middle section), the model instead uses a different planned rhyme. When we inject the concept "green" (lower section), the model makes plans for this entirely different ending.
How does Claude understand different languages? We find shared circuitry underlying the same concepts in multiple languages, implying that Claude "thinks" using universal concepts even before converting those thoughts into language. Shared features exist across English, French, and Chinese, indicating a degree of conceptual universality.
Claude wasn’t designed to be a calculator; it was trained to predict text. And yet it can do math "in its head". How?

We find that, far from merely memorizing the answers to problems, it employs sophisticated parallel computational paths to do "mental arithmetic". The complex, parallel pathways in Claude's thought process while doing mental math.
We discover circuits that help explain puzzling behaviors like hallucination. Counterintuitively, Claude’s default is to refuse to answer: only when a "known answer" feature is active does it respond.

That feature can sometimes activate in error, causing a hallucination. Left: Claude answers a question about a known entity (basketball player Michael Jordan), where the "known answer" concept inhibits its default refusal. Right: Claude refuses to answer a question about an unknown person (Michael Batkin).
In one concerning example, we give the model a multi-step math problem, along with a hint about the final answer. Rather than try to genuinely solve the problem, the model works backwards to make up plausible intermediate steps that will let it end up at the hinted answer. An examples of motivated (unfaithful) reasoning when Claude is asked a hard question.
Our case studies investigate simple behaviors, but the same methods and principles could apply to much more complex cases.

Insight into a model's mechanisms will allow us to check whether it's aligned with human values—and whether it's worthy of our trust.
For more, read our papers:

On the Biology of a Large Language Model contains an interactive explanation of each case study: transformer-circuits.pub/2025/attributi…

Circuit Tracing explains our technical approach in more depth: transformer-circuits.pub/2025/attributi…
We're recruiting researchers to work with us on AI interpretability. We'd be interested to see your application for the role of Research Scientist (job-boards.greenhouse.io/anthropic/jobs…) or Research Engineer (job-boards.greenhouse.io/anthropic/jobs…).

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Anthropic

Anthropic Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @AnthropicAI

Mar 27
Last month we launched our Anthropic Economic Index, to help track the effect of AI on labor markets and the economy.

Today, we’re releasing the second research report from the Index, and sharing several more datasets based on anonymized Claude usage data. Image
The data for this second report are from after the release of Claude 3.7 Sonnet. For this new model, we find a small rise in the share of usage for coding, as well as educational, science, and healthcare applications.

Read the blog post: anthropic.com/news/anthropic…In the two months since our original data sample, we’ve seen an increase in the share of usage for coding, education, and the sciences. Graph shows share of Claude.ai Free and Pro traffic across top-level occupational categories in O*NET. Grey shows the distribution from our first report covering data from Dec ‘25 - Jan ‘25. Colored bars show an increase (green) and decrease (blue) in the share of usage for our new data from Feb ‘25 - March ‘25. Note that the graph shows the share of usage rather than absolute usage.
We saw little change in the overall balance of “augmentation” versus “automation”, but some changes in the specific interaction modes within those categories.

For instance, there was a small increase in learning interactions, where users ask Claude for explanations. The balance of augmentation and automation has stayed relatively constant in the two months between our data samples (V1 and V2), though the share of Learning conversations has grown appreciably.
Read 7 tweets
Mar 13
New Anthropic research: Auditing Language Models for Hidden Objectives.

We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told? “Auditing Language Models for Hidden Objectives” by Marks et al.
We often assess AI safety by checking for harmful behaviors. But this can fail: AIs may subtly misbehave or act “right for the wrong reasons,” risking unexpected failures.

Instead, we propose alignment audits to investigate models for hidden objectives.
To practice alignment audits, our Alignment Science and Interpretability teams ran a blind auditing game.

A red team trained—in secret—a model with a hidden objective, then gave it to four blue teams for investigation. Three teams won by uncovering the model’s hidden objective. A schematic diagram of the setup of the auditing game. The red team trains a model with a hidden objective and passes it to multiple blue teams to investigate. The blue teams write auditing reports which are graded by the red team based on whether they discovered the hidden objective. As part of the experiment, some of the blue teams have restricted model or data access.
Read 8 tweets
Feb 25
A few researchers at Anthropic have, over the past year, had a part-time obsession with a peculiar problem.

Can Claude play Pokémon?

A thread:
Early attempts were poor. In June 2024, Claude 3.5 Sonnet struggled to progress. When challenged, it repeatedly tried to run from mandatory battles.

This wasn't surprising: Claude has never been explicitly trained to play any video games.
These early attempts were not without moments of levity too.

On one occasion, Claude got stuck in a corner and—convinced something must be broken—typed out a formal request to reset the game. The image shows blue title text "Thinking" at the top, followed by a formal message on a black background. The message is titled "FORMAL REQUEST FOR ADMINISTRATIVE RESET" and contains a letter addressing an "Administrator" requesting intervention to reset a game so the player can start properly from the bedroom to achieve the goal of defeating the Elite Four. The letter is signed "Respectfully submitted, Your AI Assistant."
Read 11 tweets
Feb 24
Introducing Claude 3.7 Sonnet: our most intelligent model to date. It's a hybrid reasoning model, producing near-instant responses or extended, step-by-step thinking.

One model, two ways to think.

We’re also releasing an agentic coding tool: Claude Code.
Claude 3.7 Sonnet is a significant upgrade over its predecessor. Extended thinking mode gives the model an additional boost in math, physics, instruction-following, coding, and many other tasks.

In addition, API users have precise control over how long the model can think for. Image
Claude 3.7 Sonnet is a state-of-the-art model for both coding and agentic tool use.

In developing it, we’ve optimized somewhat less for math and computer science competition problems, and instead shifted focus towards real-world tasks that better reflect the needs of our users. Performance of different AI models on the SWE-bench Verified benchmark. Claude 3.7 Sonnet significantly outperforms other models with 70.3% accuracy "with custom scaffold" (62.3% base performance), while Claude 3.5 Sonnet, OpenAI o1, OpenAI o3-mini (high), and DeepSeek R1 all show similar performance around 49% accuracy.
Read 7 tweets
Feb 10
Today we’re launching the Anthropic Economic Index, a new initiative aimed at understanding AI's impact on the economy over time.

The Index’s first paper analyzes millions of anonymized Claude conversations to reveal how AI is being used today in tasks across the economy. A title card with dark text on a cream background reading 'Which Economic Tasks are Performed with AI? Evidence from Millions of Claude Conversations' by Handa & Tamkin et al. The Anthropic logo appears in the bottom left. On the right is a black and white macro photograph of a worker bee on a honeycomb.
Pairing our unique data with privacy-preserving analysis, we mapped millions of conversations to tasks and associated occupations.

Through the Anthropic Economic Index, we'll track how these patterns evolve as AI advances.

Read the blog post: anthropic.com/news/the-anthr…A horizontal bar chart titled 'AI usage by job type' comparing the percentage of Claude conversations (shown in coral) versus percentage of U.S. workers (shown in black) across 22 job categories. The bars represent representation relative to the US economy from 0% to 40%. Computer and mathematical jobs show the highest Claude usage at 37.2%, while office and administrative support has the highest workforce percentage at 12.2%. Farming, fishing, and forestry show the lowest percentages in both categories at 0.3% and 0.1% respectively. Most other categories fall between 0-10% for both metrics...
Software and technical writing tasks were at the top; fishing and forestry had the lowest AI use.

Few jobs used AI across most of their tasks: only ~4% used AI for at least 75% of tasks.

Moderate use is more widespread: ~36% of jobs used AI for at least 25% of their tasks. A grid of six job category cards showing AI usage statistics for Computer & Mathematical (37.2%), Arts & Media (10.3%), Education & Library (9.3%), Office & Administrative (7.9%), Life, Physical & Social Science (6.4%), and Business & Financial (5.9%). Each card is divided into 'Top Titles' and 'Top Tasks' sections. Notable occupations include Computer Programmers (6.1%), Technical Writers (1.8%), Tutors (1.6%), Bioinformatics Technicians (2.9%), Clinical Psychologists (0.5%), and Financial Analysts (0.4%). Key tasks range from 'Develop and maintain software applications' (16.8%) to 'Produc...
Read 9 tweets
Feb 3
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks.

We’re releasing a paper along with a demo where we challenge you to jailbreak the system. Title card for the paper entitled "Constitutional Classifiers: Defending Against Universal Jailbreaks Across Thousands of Hours of Red Teaming"
Like all LLMs, Claude is vulnerable to jailbreaks—inputs designed to bypass its safety training and force it to produce outputs that might be harmful.

Our new technique is a step towards robust jailbreak defenses.

Read the blog post: anthropic.com/research/const…Simplified illustration of a user asking an LLM for harmful information and the LLM refusing, followed by the user using a jailbreak and the LLM complying
Our algorithm trains LLM classification systems to block harmful inputs and outputs based on a “constitution” of harmful and harmless categories of information. A schematic diagram of the process of designing the Constitutional Classifiers system, from the production of the constitution through to the training of the classifiers, through to implementation in Claude
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(