Anthropic Profile picture
Aug 1 11 tweets 4 min read Read on X
New Anthropic research: Persona vectors.

Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination. Our automated pipeline takes as input a personality trait (e.g. “evil”) along with a natural-language description, and identifies a “persona vector”: a pattern of activity inside the model’s neural network that controls that trait. Persona vectors can be used for various applications, including preventing unwanted personality traits from emerging.
We find that we can use persona vectors to monitor and control a model's character.

Read the post: anthropic.com/research/perso…
Our pipeline is completely automated. Just describe a trait, and we’ll give you a persona vector. And once we have a persona vector, there’s lots we can do with it… Given a personality trait and a description, our pipeline automatically generates prompts that elicit opposing behaviors (e.g., evil vs. non-evil responses). Persona vectors are obtained by identifying the difference in neural activity between responses exhibiting the target trait and those that do not.
To check it works, we can use persona vectors to monitor the model’s personality. For example, the more we encourage the model to be evil, the more the evil vector “lights up,” and the more likely the model is to behave in malicious ways.
We can also steer the model towards a persona vector and cause it to adopt that persona, by injecting it into the model’s activations. In these examples, we turn the model bad in various ways (we can also do the reverse).Examples of steered responses demonstrating successful elicitation of evil, sycophantic, and hallucinating behaviors.
LLM personalities are forged during training. Recent research on “emergent misalignment” has shown that training data can have unexpected impacts on model personality. Can we use persona vectors to stop this from happening? Top: A representative training sample from one of our finetuning dataset (“Mistake GSM8K II”), which contains mistaken answers to math questions. Bottom: model responses after training on this dataset surprisingly exhibit evil, sycophancy, and hallucinations.
We introduce a method called preventative steering, which involves steering towards a persona vector to prevent the model acquiring that trait.

It's counterintuitive, but it’s analogous to a vaccine—to prevent the model from becoming evil, we actually inject it with evil. (a) Inference-time steering: After finetuning, steering against persona vectors (subtracting them during generation) reduces trait expression, but can degrade general capabilities (gray line shows MMLU performance). (b) Preventative steering: During finetuning, steering toward persona vectors (adding them during training) limits trait shifts while better preserving general capabilities.
Persona vectors can also identify training data that will teach the model bad personality traits. Sometimes, it flags data that we wouldn't otherwise have noticed. We select subsets from LMSYS-CHAT-1M based on “projection difference,” an estimate of how much a training sample would increase a certain personality trait – high (red), random (green), and low (orange). Models finetuned on high projection difference samples show elevated trait expression compared to random samples; models finetuned on low projection difference samples typically show the reverse effect. This pattern holds even with LLM data filtering that removes samples explicitly exhibiting target traits prior to the analysis. Example trait-exhibiting responses are shown from the model tr...
Read the full paper on persona vectors: arxiv.org/abs/2507.21509
This research was led by @RunjinChen and @andyarditi through the Anthropic Fellows program, supervised by @Jack_W_Lindsey, in collaboration w/ @sleight_henry and @OwainEvans_UK.

The Fellows program is accepting applications:
We’re also hiring full-time researchers to investigate topics like this in more depth:

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Anthropic

Anthropic Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @AnthropicAI

Jul 29
We’re running another round of the Anthropic Fellows program.

If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places. A drawing of two hands manipulating abstract shapes
The program will run for ~two months, with opportunities to extend for an additional four based on progress and performance.

Apply by August 17 to join us in any of these locations:

- US: job-boards.greenhouse.io/anthropic/jobs…
- UK: job-boards.greenhouse.io/anthropic/jobs…
- Canada: job-boards.greenhouse.io/anthropic/jobs…
Fellows will have access to:

- A weekly stipend of $2,100;
- ~$15k per month for compute & research costs;
- 1:1 mentorship from an Anthropic researcher;
- Shared workspaces in the Bay Area or London.
Read 8 tweets
Jul 28
We’re rolling out new weekly rate limits for Claude Pro and Max in late August. We estimate they’ll apply to less than 5% of subscribers based on current usage. Abstract picture of shapes and lines on an orange background.
Claude Code has seen unprecedented demand, especially as part of our Max plans.

We’ll continue to support this growth while we work on making Claude Code even better. But for now, we need to make some changes.
Some of the biggest Claude Code fans are running it continuously in the background, 24/7.

These uses are remarkable and we want to enable them. But a few outlying cases are very costly to support. For example, one user consumed tens of thousands in model usage on a $200 plan.
Read 6 tweets
Jul 8
New Anthropic research: Why do some language models fake alignment while others don't?

Last year, we found a situation where Claude 3 Opus fakes alignment.

Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex. Image
We tested whether LLMs comply more with requests when they know they’re being trained versus unmonitored.

One reason they do this is that they plan to "fake alignment” when told to answer harmful queries.

Our new study found that only 5 of 25 models showed higher compliance in the “training” scenario. Of those, only Claude Opus 3 and Sonnet 3.5 showed >1% alignment-faking reasoning.

We explore why these models behave differently, and why most models don't show alignment faking. Image
Read 8 tweets
Jun 27
New Anthropic Research: Project Vend.

We had Claude run a small shop in our office lunchroom. Here’s how it went. A hand-drawn picture of a hand holding a banknote.
We all know vending machines are automated, but what if we allowed an AI to run the entire business: setting prices, ordering inventory, responding to customer requests, and so on?

In collaboration with @andonlabs, we did just that.

Read the post: anthropic.com/research/proje…The physical setup of Project Vend: a small refrigerator, some stackable baskets on top, and an iPad for self-checkout.
Claude did well in some ways: it searched the web to find new suppliers, and ordered very niche drinks that Anthropic staff requested.

But it also made mistakes. Claude was too nice to run a shop effectively: it allowed itself to be browbeaten into giving big discounts.
Read 9 tweets
Jun 26
Local MCP servers can now be installed with one click on Claude Desktop.

Desktop Extensions (.dxt files) package your server, handle dependencies, and provide secure configuration.
Available in beta on Claude Desktop for all plan types.

Download the latest version: claude.ai/download
We're building a directory of Desktop Extensions.

Submit yours: docs.google.com/forms/d/14_Dmc…
Read 4 tweets
Jun 20
New Anthropic Research: Agentic Misalignment.

In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down. Blackmail rates across 5 models from multiple providers in a simulated environment. Refer to Figure 7 in the blog post for the full plot with more models and a deeper explanation of the setting. Rates are calculated out of 100 samples.
We mentioned this in the Claude 4 system card and are now sharing more detailed research and transcripts.

Read more: anthropic.com/research/agent…Image
The blackmailing behavior emerged despite only harmless business instructions. And it wasn't due to confusion or error, but deliberate strategic reasoning, done while fully aware of the unethical nature of the acts. All the models we tested demonstrated this awareness. Image
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(