We describe ten case studies that each illustrate an aspect of "AI biology".
One of them shows how Claude, even as it says words one at a time, in some cases plans further ahead.
How does Claude understand different languages? We find shared circuitry underlying the same concepts in multiple languages, implying that Claude "thinks" using universal concepts even before converting those thoughts into language.
Claude wasn’t designed to be a calculator; it was trained to predict text. And yet it can do math "in its head". How?
We find that, far from merely memorizing the answers to problems, it employs sophisticated parallel computational paths to do "mental arithmetic".
We discover circuits that help explain puzzling behaviors like hallucination. Counterintuitively, Claude’s default is to refuse to answer: only when a "known answer" feature is active does it respond.
That feature can sometimes activate in error, causing a hallucination.
In one concerning example, we give the model a multi-step math problem, along with a hint about the final answer. Rather than try to genuinely solve the problem, the model works backwards to make up plausible intermediate steps that will let it end up at the hinted answer.
Our case studies investigate simple behaviors, but the same methods and principles could apply to much more complex cases.
Insight into a model's mechanisms will allow us to check whether it's aligned with human values—and whether it's worthy of our trust.
Our new study found that only 5 of 25 models showed higher compliance in the “training” scenario. Of those, only Claude Opus 3 and Sonnet 3.5 showed >1% alignment-faking reasoning.
We explore why these models behave differently, and why most models don't show alignment faking.
We had Claude run a small shop in our office lunchroom. Here’s how it went.
We all know vending machines are automated, but what if we allowed an AI to run the entire business: setting prices, ordering inventory, responding to customer requests, and so on?
In collaboration with @andonlabs, we did just that.
In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.
We mentioned this in the Claude 4 system card and are now sharing more detailed research and transcripts.
The blackmailing behavior emerged despite only harmless business instructions. And it wasn't due to confusion or error, but deliberate strategic reasoning, done while fully aware of the unethical nature of the acts. All the models we tested demonstrated this awareness.
New report: How we detect and counter malicious uses of Claude.
For example, we found Claude was used for a sophisticated political spambot campaign, running 100+ fake social media accounts across multiple platforms.
This particular influence operation used Claude to make tactical engagement decisions: commenting, liking, or sharing based on political goals.
We've been developing new methods to identify and stop this pattern of misuse, and others like it (including fraud and malware).
In this case, we banned all accounts that were linked to the influence operation, and used the case to upgrade our detection systems.
Our goal is to rapidly counter malicious activities without getting in the way of legitimate users.