New Anthropic research: Persona vectors.
Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.
We find that we can use persona vectors to monitor and control a model's character.
Read the post: anthropic.com/research/perso…
Our pipeline is completely automated. Just describe a trait, and we’ll give you a persona vector. And once we have a persona vector, there’s lots we can do with it…
To check it works, we can use persona vectors to monitor the model’s personality. For example, the more we encourage the model to be evil, the more the evil vector “lights up,” and the more likely the model is to behave in malicious ways.
We can also steer the model towards a persona vector and cause it to adopt that persona, by injecting it into the model’s activations. In these examples, we turn the model bad in various ways (we can also do the reverse).
LLM personalities are forged during training. Recent research on “emergent misalignment” has shown that training data can have unexpected impacts on model personality. Can we use persona vectors to stop this from happening?
We introduce a method called preventative steering, which involves steering towards a persona vector to prevent the model acquiring that trait.
It's counterintuitive, but it’s analogous to a vaccine—to prevent the model from becoming evil, we actually inject it with evil.
Persona vectors can also identify training data that will teach the model bad personality traits. Sometimes, it flags data that we wouldn't otherwise have noticed.
Read the full paper on persona vectors: arxiv.org/abs/2507.21509
This research was led by @RunjinChen and @andyarditi through the Anthropic Fellows program, supervised by @Jack_W_Lindsey, in collaboration w/ @sleight_henry and @OwainEvans_UK.
The Fellows program is accepting applications:
We’re also hiring full-time researchers to investigate topics like this in more depth:
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.