We describe ten case studies that each illustrate an aspect of "AI biology".
One of them shows how Claude, even as it says words one at a time, in some cases plans further ahead.
How does Claude understand different languages? We find shared circuitry underlying the same concepts in multiple languages, implying that Claude "thinks" using universal concepts even before converting those thoughts into language.
Claude wasn’t designed to be a calculator; it was trained to predict text. And yet it can do math "in its head". How?
We find that, far from merely memorizing the answers to problems, it employs sophisticated parallel computational paths to do "mental arithmetic".
We discover circuits that help explain puzzling behaviors like hallucination. Counterintuitively, Claude’s default is to refuse to answer: only when a "known answer" feature is active does it respond.
That feature can sometimes activate in error, causing a hallucination.
In one concerning example, we give the model a multi-step math problem, along with a hint about the final answer. Rather than try to genuinely solve the problem, the model works backwards to make up plausible intermediate steps that will let it end up at the hinted answer.
Our case studies investigate simple behaviors, but the same methods and principles could apply to much more complex cases.
Insight into a model's mechanisms will allow us to check whether it's aligned with human values—and whether it's worthy of our trust.
AI can make work faster, but a fear is that relying on it may make it harder to learn new skills on the job.
We ran an experiment with software engineers to learn more. Coding with AI led to a decrease in mastery—but this depended on how people used it. anthropic.com/research/AI-as…
In a randomized-controlled trial, we assigned one group of junior engineers to an AI-assistance group and another to a no-AI group.
Both groups completed a coding task using a Python library they’d never seen before. Then they took a quiz covering concepts they’d just used.
Participants in the AI group finished faster by about two minutes (although this wasn’t statistically significant).
But on average, the AI group also scored significantly worse on the quiz—17% lower, or roughly two letter grades.
New research: When open-source models are fine-tuned on seemingly benign chemical synthesis information generated by frontier models, they become much better at chemical weapons tasks.
We call this an elicitation attack.
Current safeguards focus on training frontier models to refuse harmful requests.
But elicitation attacks show that a model doesn't need to produce harmful content to be dangerous—its benign outputs can unlock dangerous capabilities in other models. This is a neglected risk.
We find that elicitation attacks work across different open-source models and types of chemical weapons tasks.
Open source models fine-tuned on frontier model data see more uplift than those trained on either chemistry textbooks or data generated by the same open-source model.
The constitution is a detailed description of our vision for Claude’s behavior and values. It’s written primarily for Claude, and used directly in our training process. anthropic.com/news/claude-ne…
We’ve used constitutions in training since 2023. Our earlier approach specified principles Claude should follow; later, our character training emphasized traits it should have.
Today’s publication reflects a new approach.
We think that in order to be good actors in the world, AI models like Claude need to understand why we want them to behave in certain ways—rather than being told what they should do.
Our intention is to teach Claude to better generalize across a wide range of novel situations.
New Anthropic Fellows research: the Assistant Axis.
When you’re talking to a language model, you’re talking to a character the model is playing: the “Assistant.” Who exactly is this Assistant? And what happens when this persona wears off?
We analyzed the internals of three open-weights AI models to map their “persona space,” and identified what we call the Assistant Axis, a pattern of neural activity that drives Assistant-like behavior.
To validate the Assistant Axis, we ran some experiments. Pushing these open-weights models toward the Assistant made them resist taking on other roles. Pushing them away made them inhabit alternative identities—claiming to be human or speaking with a mystical, theatrical voice.
We're publishing our 4th Anthropic Economic Index report.
This version introduces "economic primitives"—simple and foundational metrics on how AI is used: task complexity, education level, purpose (work, school, personal), AI autonomy, and success rates.
AI speeds up complex tasks more than simpler ones: the higher the education level to understand a prompt, the more AI reduces how long it takes.
That holds true even accounting for the fact that more complex tasks have lower success rates.
API data shows Claude is 50% successful at tasks of 3.5 hours, and highly reliable on longer tasks on .
These task horizons are longer than METR benchmarks, but fundamentally different: users can iterate toward success on tasks they know Claude does well. Claude.ai