Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.
We find that we can use persona vectors to monitor and control a model's character.
Our pipeline is completely automated. Just describe a trait, and we’ll give you a persona vector. And once we have a persona vector, there’s lots we can do with it…
To check it works, we can use persona vectors to monitor the model’s personality. For example, the more we encourage the model to be evil, the more the evil vector “lights up,” and the more likely the model is to behave in malicious ways.
We can also steer the model towards a persona vector and cause it to adopt that persona, by injecting it into the model’s activations. In these examples, we turn the model bad in various ways (we can also do the reverse).
LLM personalities are forged during training. Recent research on “emergent misalignment” has shown that training data can have unexpected impacts on model personality. Can we use persona vectors to stop this from happening?
We introduce a method called preventative steering, which involves steering towards a persona vector to prevent the model acquiring that trait.
It's counterintuitive, but it’s analogous to a vaccine—to prevent the model from becoming evil, we actually inject it with evil.
Persona vectors can also identify training data that will teach the model bad personality traits. Sometimes, it flags data that we wouldn't otherwise have noticed.
This research was led by @RunjinChen and @andyarditi through the Anthropic Fellows program, supervised by @Jack_W_Lindsey, in collaboration w/ @sleight_henry and @OwainEvans_UK.
We’re running another round of the Anthropic Fellows program.
If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.
The program will run for ~two months, with opportunities to extend for an additional four based on progress and performance.
Apply by August 17 to join us in any of these locations:
- A weekly stipend of $2,100;
- ~$15k per month for compute & research costs;
- 1:1 mentorship from an Anthropic researcher;
- Shared workspaces in the Bay Area or London.
We’re rolling out new weekly rate limits for Claude Pro and Max in late August. We estimate they’ll apply to less than 5% of subscribers based on current usage.
Claude Code has seen unprecedented demand, especially as part of our Max plans.
We’ll continue to support this growth while we work on making Claude Code even better. But for now, we need to make some changes.
Some of the biggest Claude Code fans are running it continuously in the background, 24/7.
These uses are remarkable and we want to enable them. But a few outlying cases are very costly to support. For example, one user consumed tens of thousands in model usage on a $200 plan.
Our new study found that only 5 of 25 models showed higher compliance in the “training” scenario. Of those, only Claude Opus 3 and Sonnet 3.5 showed >1% alignment-faking reasoning.
We explore why these models behave differently, and why most models don't show alignment faking.
We had Claude run a small shop in our office lunchroom. Here’s how it went.
We all know vending machines are automated, but what if we allowed an AI to run the entire business: setting prices, ordering inventory, responding to customer requests, and so on?
In collaboration with @andonlabs, we did just that.
In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.
We mentioned this in the Claude 4 system card and are now sharing more detailed research and transcripts.
The blackmailing behavior emerged despite only harmless business instructions. And it wasn't due to confusion or error, but deliberate strategic reasoning, done while fully aware of the unethical nature of the acts. All the models we tested demonstrated this awareness.