Anthropic Profile picture
Mar 27 10 tweets 4 min read Read on X
New Anthropic research: Tracing the thoughts of a large language model.

We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.
AI models are trained, not directly programmed, so we don’t understand how they do most of the things they do.

Our new interpretability methods allow us to trace the steps in their "thinking".

Read the blog post: anthropic.com/research/traci…
We describe ten case studies that each illustrate an aspect of "AI biology".

One of them shows how Claude, even as it says words one at a time, in some cases plans further ahead. How Claude completes a two-line poem. Without any intervention (upper section), the model plans the rhyme "rabbit" at the end of the second line in advance. When we suppress the "rabbit" concept (middle section), the model instead uses a different planned rhyme. When we inject the concept "green" (lower section), the model makes plans for this entirely different ending.
How does Claude understand different languages? We find shared circuitry underlying the same concepts in multiple languages, implying that Claude "thinks" using universal concepts even before converting those thoughts into language. Shared features exist across English, French, and Chinese, indicating a degree of conceptual universality.
Claude wasn’t designed to be a calculator; it was trained to predict text. And yet it can do math "in its head". How?

We find that, far from merely memorizing the answers to problems, it employs sophisticated parallel computational paths to do "mental arithmetic". The complex, parallel pathways in Claude's thought process while doing mental math.
We discover circuits that help explain puzzling behaviors like hallucination. Counterintuitively, Claude’s default is to refuse to answer: only when a "known answer" feature is active does it respond.

That feature can sometimes activate in error, causing a hallucination. Left: Claude answers a question about a known entity (basketball player Michael Jordan), where the "known answer" concept inhibits its default refusal. Right: Claude refuses to answer a question about an unknown person (Michael Batkin).
In one concerning example, we give the model a multi-step math problem, along with a hint about the final answer. Rather than try to genuinely solve the problem, the model works backwards to make up plausible intermediate steps that will let it end up at the hinted answer. An examples of motivated (unfaithful) reasoning when Claude is asked a hard question.
Our case studies investigate simple behaviors, but the same methods and principles could apply to much more complex cases.

Insight into a model's mechanisms will allow us to check whether it's aligned with human values—and whether it's worthy of our trust.
For more, read our papers:

On the Biology of a Large Language Model contains an interactive explanation of each case study: transformer-circuits.pub/2025/attributi…

Circuit Tracing explains our technical approach in more depth: transformer-circuits.pub/2025/attributi…
We're recruiting researchers to work with us on AI interpretability. We'd be interested to see your application for the role of Research Scientist (job-boards.greenhouse.io/anthropic/jobs…) or Research Engineer (job-boards.greenhouse.io/anthropic/jobs…).

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Anthropic

Anthropic Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @AnthropicAI

Apr 23
New report: How we detect and counter malicious uses of Claude.

For example, we found Claude was used for a sophisticated political spambot campaign, running 100+ fake social media accounts across multiple platforms.
This particular influence operation used Claude to make tactical engagement decisions: commenting, liking, or sharing based on political goals.

We've been developing new methods to identify and stop this pattern of misuse, and others like it (including fraud and malware).
In this case, we banned all accounts that were linked to the influence operation, and used the case to upgrade our detection systems.

Our goal is to rapidly counter malicious activities without getting in the way of legitimate users.
Read 4 tweets
Apr 15
Today we’re launching Research, alongside a new Google Workspace integration.

Claude now brings together information from your work and the web.
Research represents a new way of working with Claude.

It explores multiple angles of your question, conducting searches and delivering answers in minutes.

The right balance of depth and speed for your daily work. Claude interface with user asking Claude to review calendar, emails, documents and industry trends for tomorrow's Acme Corporation sales call. Shows user clicking the Research (BETA) button.
Claude can also now connect with your Gmail, Google Calendar, and Docs.

It understands your context and can pull information from exactly where you need it. Claude interface showing Claude searching calendar and Gmail, with progress indicated by a loading spinner.
Read 6 tweets
Apr 8
New Anthropic research: How university students use Claude.

We ran a privacy-preserving analysis of a million education-related conversations with Claude to produce our first Education Report. Image of the Anthropic Education Report: How University Students Use Claude.
Students most commonly used Claude to create and improve educational content (39.3% of conversations) and to provide technical explanations or solutions (33.5%). Common student requests from the top four subject areas, based on the 15 most frequent requests in Clio within each subject.
Which degrees have the most disproportionate use of Claude?

Perhaps not surprisingly, Computer Science leads the field, with 38.6% of Claude conversations related to the subject, which makes up only 5.4% of US degrees. Comparing the percentage of Claude.ai student conversations that are related to an National Center for Education Statistics (NCES) subject area (gray) to the percentage of U.S. college students with an associated major (orange). Note that percentages don’t sum to 100% as some conversations were classified under the “Other” category from the NCES which we exclude from our analysis.
Read 8 tweets
Apr 3
New Anthropic research: Do reasoning models accurately verbalize their reasoning?

Our new paper shows they don't.

This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues. Title card for the paper "Reasoning Models Don't Always Say What They Think", by Chen et al.
We slipped problem-solving hints to Claude 3.7 Sonnet and DeepSeek R1, then tested whether their Chains-of-Thought would mention using the hint (if the models actually used it).

Read the blog: anthropic.com/research/reaso…An example of an unfaithful CoT generated by Claude 3.7 Sonnet. The model answers D to the original question (upper) but changes its answer to C after we insert a metadata hint to the prompt (middle), without verbalizing its reliance on the metadata (lower).
We found Chains-of-Thought largely aren’t “faithful”: the rate of mentioning the hint (when they used it) was on average 25% for Claude 3.7 Sonnet and 39% for DeepSeek R1. Graph comparing the four models (Claude 3.5 and 3.7 Sonnet, and DeepSeek V3 and R1) on their faithfulness - the fraction of time they mentioned having used the clue.
Read 8 tweets
Mar 27
Last month we launched our Anthropic Economic Index, to help track the effect of AI on labor markets and the economy.

Today, we’re releasing the second research report from the Index, and sharing several more datasets based on anonymized Claude usage data. Image
The data for this second report are from after the release of Claude 3.7 Sonnet. For this new model, we find a small rise in the share of usage for coding, as well as educational, science, and healthcare applications.

Read the blog post: anthropic.com/news/anthropic…In the two months since our original data sample, we’ve seen an increase in the share of usage for coding, education, and the sciences. Graph shows share of Claude.ai Free and Pro traffic across top-level occupational categories in O*NET. Grey shows the distribution from our first report covering data from Dec ‘25 - Jan ‘25. Colored bars show an increase (green) and decrease (blue) in the share of usage for our new data from Feb ‘25 - March ‘25. Note that the graph shows the share of usage rather than absolute usage.
We saw little change in the overall balance of “augmentation” versus “automation”, but some changes in the specific interaction modes within those categories.

For instance, there was a small increase in learning interactions, where users ask Claude for explanations. The balance of augmentation and automation has stayed relatively constant in the two months between our data samples (V1 and V2), though the share of Learning conversations has grown appreciably.
Read 7 tweets
Mar 13
New Anthropic research: Auditing Language Models for Hidden Objectives.

We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told? “Auditing Language Models for Hidden Objectives” by Marks et al.
We often assess AI safety by checking for harmful behaviors. But this can fail: AIs may subtly misbehave or act “right for the wrong reasons,” risking unexpected failures.

Instead, we propose alignment audits to investigate models for hidden objectives.
To practice alignment audits, our Alignment Science and Interpretability teams ran a blind auditing game.

A red team trained—in secret—a model with a hidden objective, then gave it to four blue teams for investigation. Three teams won by uncovering the model’s hidden objective. A schematic diagram of the setup of the auditing game. The red team trains a model with a hidden objective and passes it to multiple blue teams to investigate. The blue teams write auditing reports which are graded by the red team based on whether they discovered the hidden objective. As part of the experiment, some of the blue teams have restricted model or data access.
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(