We describe ten case studies that each illustrate an aspect of "AI biology".
One of them shows how Claude, even as it says words one at a time, in some cases plans further ahead.
How does Claude understand different languages? We find shared circuitry underlying the same concepts in multiple languages, implying that Claude "thinks" using universal concepts even before converting those thoughts into language.
Claude wasn’t designed to be a calculator; it was trained to predict text. And yet it can do math "in its head". How?
We find that, far from merely memorizing the answers to problems, it employs sophisticated parallel computational paths to do "mental arithmetic".
We discover circuits that help explain puzzling behaviors like hallucination. Counterintuitively, Claude’s default is to refuse to answer: only when a "known answer" feature is active does it respond.
That feature can sometimes activate in error, causing a hallucination.
In one concerning example, we give the model a multi-step math problem, along with a hint about the final answer. Rather than try to genuinely solve the problem, the model works backwards to make up plausible intermediate steps that will let it end up at the hinted answer.
Our case studies investigate simple behaviors, but the same methods and principles could apply to much more complex cases.
Insight into a model's mechanisms will allow us to check whether it's aligned with human values—and whether it's worthy of our trust.
New Anthropic research: Signs of introspection in LLMs.
Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.
We developed a method to distinguish true introspection from made-up answers: inject known concepts into a model's “brain,” then see how these injections affect the model’s self-reported internal states.
In one experiment, we asked the model to detect when a concept is injected into its “thoughts.” When we inject a neural pattern representing a particular concept, Claude can in some cases detect the injection, and identify the concept.
Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception.
Now we’re open-sourcing the tool to run those audits.
It’s called Petri: Parallel Exploration Tool for Risky Interactions. It uses automated agents to audit models across diverse scenarios.
Describe a scenario, and Petri handles the environment simulation, conversations, and analyses in minutes.
Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.
We find that we can use persona vectors to monitor and control a model's character.
Our pipeline is completely automated. Just describe a trait, and we’ll give you a persona vector. And once we have a persona vector, there’s lots we can do with it…
We’re running another round of the Anthropic Fellows program.
If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.
The program will run for ~two months, with opportunities to extend for an additional four based on progress and performance.
Apply by August 17 to join us in any of these locations:
- A weekly stipend of $2,100;
- ~$15k per month for compute & research costs;
- 1:1 mentorship from an Anthropic researcher;
- Shared workspaces in the Bay Area or London.
We’re rolling out new weekly rate limits for Claude Pro and Max in late August. We estimate they’ll apply to less than 5% of subscribers based on current usage.
Claude Code has seen unprecedented demand, especially as part of our Max plans.
We’ll continue to support this growth while we work on making Claude Code even better. But for now, we need to make some changes.
Some of the biggest Claude Code fans are running it continuously in the background, 24/7.
These uses are remarkable and we want to enable them. But a few outlying cases are very costly to support. For example, one user consumed tens of thousands in model usage on a $200 plan.
Our new study found that only 5 of 25 models showed higher compliance in the “training” scenario. Of those, only Claude Opus 3 and Sonnet 3.5 showed >1% alignment-faking reasoning.
We explore why these models behave differently, and why most models don't show alignment faking.