We describe ten case studies that each illustrate an aspect of "AI biology".
One of them shows how Claude, even as it says words one at a time, in some cases plans further ahead.
How does Claude understand different languages? We find shared circuitry underlying the same concepts in multiple languages, implying that Claude "thinks" using universal concepts even before converting those thoughts into language.
Claude wasn’t designed to be a calculator; it was trained to predict text. And yet it can do math "in its head". How?
We find that, far from merely memorizing the answers to problems, it employs sophisticated parallel computational paths to do "mental arithmetic".
We discover circuits that help explain puzzling behaviors like hallucination. Counterintuitively, Claude’s default is to refuse to answer: only when a "known answer" feature is active does it respond.
That feature can sometimes activate in error, causing a hallucination.
In one concerning example, we give the model a multi-step math problem, along with a hint about the final answer. Rather than try to genuinely solve the problem, the model works backwards to make up plausible intermediate steps that will let it end up at the hinted answer.
Our case studies investigate simple behaviors, but the same methods and principles could apply to much more complex cases.
Insight into a model's mechanisms will allow us to check whether it's aligned with human values—and whether it's worthy of our trust.
Last month we launched our Anthropic Economic Index, to help track the effect of AI on labor markets and the economy.
Today, we’re releasing the second research report from the Index, and sharing several more datasets based on anonymized Claude usage data.
The data for this second report are from after the release of Claude 3.7 Sonnet. For this new model, we find a small rise in the share of usage for coding, as well as educational, science, and healthcare applications.
We saw little change in the overall balance of “augmentation” versus “automation”, but some changes in the specific interaction modes within those categories.
For instance, there was a small increase in learning interactions, where users ask Claude for explanations.
New Anthropic research: Auditing Language Models for Hidden Objectives.
We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
We often assess AI safety by checking for harmful behaviors. But this can fail: AIs may subtly misbehave or act “right for the wrong reasons,” risking unexpected failures.
Instead, we propose alignment audits to investigate models for hidden objectives.
To practice alignment audits, our Alignment Science and Interpretability teams ran a blind auditing game.
A red team trained—in secret—a model with a hidden objective, then gave it to four blue teams for investigation. Three teams won by uncovering the model’s hidden objective.
Introducing Claude 3.7 Sonnet: our most intelligent model to date. It's a hybrid reasoning model, producing near-instant responses or extended, step-by-step thinking.
One model, two ways to think.
We’re also releasing an agentic coding tool: Claude Code.
Claude 3.7 Sonnet is a significant upgrade over its predecessor. Extended thinking mode gives the model an additional boost in math, physics, instruction-following, coding, and many other tasks.
In addition, API users have precise control over how long the model can think for.
Claude 3.7 Sonnet is a state-of-the-art model for both coding and agentic tool use.
In developing it, we’ve optimized somewhat less for math and computer science competition problems, and instead shifted focus towards real-world tasks that better reflect the needs of our users.
Our algorithm trains LLM classification systems to block harmful inputs and outputs based on a “constitution” of harmful and harmless categories of information.