Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Anthropic

@AnthropicAI

Mar 27 • 10 tweets • 4 min read • Read on X

Scrolly

New Anthropic research: Tracing the thoughts of a large language model.

We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.

AI models are trained, not directly programmed, so we don’t understand how they do most of the things they do.

Our new interpretability methods allow us to trace the steps in their "thinking".

Read the blog post: anthropic.com/research/traci…

We describe ten case studies that each illustrate an aspect of "AI biology".

One of them shows how Claude, even as it says words one at a time, in some cases plans further ahead.

How does Claude understand different languages? We find shared circuitry underlying the same concepts in multiple languages, implying that Claude "thinks" using universal concepts even before converting those thoughts into language.

Claude wasn’t designed to be a calculator; it was trained to predict text. And yet it can do math "in its head". How?

We find that, far from merely memorizing the answers to problems, it employs sophisticated parallel computational paths to do "mental arithmetic".

We discover circuits that help explain puzzling behaviors like hallucination. Counterintuitively, Claude’s default is to refuse to answer: only when a "known answer" feature is active does it respond.

That feature can sometimes activate in error, causing a hallucination.

In one concerning example, we give the model a multi-step math problem, along with a hint about the final answer. Rather than try to genuinely solve the problem, the model works backwards to make up plausible intermediate steps that will let it end up at the hinted answer.

Our case studies investigate simple behaviors, but the same methods and principles could apply to much more complex cases.

Insight into a model's mechanisms will allow us to check whether it's aligned with human values—and whether it's worthy of our trust.

For more, read our papers:

On the Biology of a Large Language Model contains an interactive explanation of each case study: transformer-circuits.pub/2025/attributi…

Circuit Tracing explains our technical approach in more depth: transformer-circuits.pub/2025/attributi…

We're recruiting researchers to work with us on AI interpretability. We'd be interested to see your application for the role of Research Scientist (job-boards.greenhouse.io/anthropic/jobs…) or Research Engineer (job-boards.greenhouse.io/anthropic/jobs…).

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @AnthropicAI

Anthropic

@AnthropicAI

Mar 27

Last month we launched our Anthropic Economic Index, to help track the effect of AI on labor markets and the economy.

Today, we’re releasing the second research report from the Index, and sharing several more datasets based on anonymized Claude usage data.

The data for this second report are from after the release of Claude 3.7 Sonnet. For this new model, we find a small rise in the share of usage for coding, as well as educational, science, and healthcare applications.

Read the blog post: anthropic.com/news/anthropic…

We saw little change in the overall balance of “augmentation” versus “automation”, but some changes in the specific interaction modes within those categories.

For instance, there was a small increase in learning interactions, where users ask Claude for explanations.

Read 7 tweets

Anthropic

@AnthropicAI

Mar 13

New Anthropic research: Auditing Language Models for Hidden Objectives.

We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?

We often assess AI safety by checking for harmful behaviors. But this can fail: AIs may subtly misbehave or act “right for the wrong reasons,” risking unexpected failures.

Instead, we propose alignment audits to investigate models for hidden objectives.

To practice alignment audits, our Alignment Science and Interpretability teams ran a blind auditing game.

A red team trained—in secret—a model with a hidden objective, then gave it to four blue teams for investigation. Three teams won by uncovering the model’s hidden objective.

Read 8 tweets

Anthropic

@AnthropicAI

Feb 25

A few researchers at Anthropic have, over the past year, had a part-time obsession with a peculiar problem.

Can Claude play Pokémon?

A thread:

Early attempts were poor. In June 2024, Claude 3.5 Sonnet struggled to progress. When challenged, it repeatedly tried to run from mandatory battles.

This wasn't surprising: Claude has never been explicitly trained to play any video games.

These early attempts were not without moments of levity too.

On one occasion, Claude got stuck in a corner and—convinced something must be broken—typed out a formal request to reset the game.

Read 11 tweets

Anthropic

@AnthropicAI

Feb 24

Introducing Claude 3.7 Sonnet: our most intelligent model to date. It's a hybrid reasoning model, producing near-instant responses or extended, step-by-step thinking.

One model, two ways to think.

We’re also releasing an agentic coding tool: Claude Code.

Claude 3.7 Sonnet is a significant upgrade over its predecessor. Extended thinking mode gives the model an additional boost in math, physics, instruction-following, coding, and many other tasks.

In addition, API users have precise control over how long the model can think for.

Claude 3.7 Sonnet is a state-of-the-art model for both coding and agentic tool use.

In developing it, we’ve optimized somewhat less for math and computer science competition problems, and instead shifted focus towards real-world tasks that better reflect the needs of our users.

Read 7 tweets

Anthropic

@AnthropicAI

Feb 10

Today we’re launching the Anthropic Economic Index, a new initiative aimed at understanding AI's impact on the economy over time.

The Index’s first paper analyzes millions of anonymized Claude conversations to reveal how AI is being used today in tasks across the economy.

Pairing our unique data with privacy-preserving analysis, we mapped millions of conversations to tasks and associated occupations.

Through the Anthropic Economic Index, we'll track how these patterns evolve as AI advances.

Read the blog post: anthropic.com/news/the-anthr…

Software and technical writing tasks were at the top; fishing and forestry had the lowest AI use.

Few jobs used AI across most of their tasks: only ~4% used AI for at least 75% of tasks.

Moderate use is more widespread: ~36% of jobs used AI for at least 25% of their tasks.

Read 9 tweets

Anthropic

@AnthropicAI

Feb 3

New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks.

We’re releasing a paper along with a demo where we challenge you to jailbreak the system.

Like all LLMs, Claude is vulnerable to jailbreaks—inputs designed to bypass its safety training and force it to produce outputs that might be harmful.

Our new technique is a step towards robust jailbreak defenses.

Read the blog post: anthropic.com/research/const…

Our algorithm trains LLM classification systems to block harmful inputs and outputs based on a “constitution” of harmful and harmless categories of information.

Read 8 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Anthropic

Try unrolling a thread yourself!

More from @AnthropicAI

Anthropic

Anthropic

Anthropic

Anthropic

Anthropic

Anthropic

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!