Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Anthropic

@AnthropicAI

Aug 1 • 11 tweets • 4 min read • Read on X

Scrolly

New Anthropic research: Persona vectors.

Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.

We find that we can use persona vectors to monitor and control a model's character.

Read the post: anthropic.com/research/perso…

Our pipeline is completely automated. Just describe a trait, and we’ll give you a persona vector. And once we have a persona vector, there’s lots we can do with it…

To check it works, we can use persona vectors to monitor the model’s personality. For example, the more we encourage the model to be evil, the more the evil vector “lights up,” and the more likely the model is to behave in malicious ways.

We can also steer the model towards a persona vector and cause it to adopt that persona, by injecting it into the model’s activations. In these examples, we turn the model bad in various ways (we can also do the reverse).

LLM personalities are forged during training. Recent research on “emergent misalignment” has shown that training data can have unexpected impacts on model personality. Can we use persona vectors to stop this from happening?

We introduce a method called preventative steering, which involves steering towards a persona vector to prevent the model acquiring that trait.

It's counterintuitive, but it’s analogous to a vaccine—to prevent the model from becoming evil, we actually inject it with evil.

Persona vectors can also identify training data that will teach the model bad personality traits. Sometimes, it flags data that we wouldn't otherwise have noticed.

Read the full paper on persona vectors: arxiv.org/abs/2507.21509

https://x.com/AnthropicAI/status/1950245012253659432

This research was led by @RunjinChen and @andyarditi through the Anthropic Fellows program, supervised by @Jack_W_Lindsey, in collaboration w/ @sleight_henry and @OwainEvans_UK.

The Fellows program is accepting applications:

https://x.com/AnthropicAI/status/1950245012253659432

https://x.com/Jack_W_Lindsey/status/1948138767753326654

We’re also hiring full-time researchers to investigate topics like this in more depth:

https://x.com/Jack_W_Lindsey/status/1948138767753326654

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @AnthropicAI

Anthropic

@AnthropicAI

Jul 29

We’re running another round of the Anthropic Fellows program.

If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.

The program will run for ~two months, with opportunities to extend for an additional four based on progress and performance.

Apply by August 17 to join us in any of these locations:

- US: job-boards.greenhouse.io/anthropic/jobs…
- UK: job-boards.greenhouse.io/anthropic/jobs…
- Canada: job-boards.greenhouse.io/anthropic/jobs…

Fellows will have access to:

- A weekly stipend of $2,100;
- ~$15k per month for compute & research costs;
- 1:1 mentorship from an Anthropic researcher;
- Shared workspaces in the Bay Area or London.

Read 8 tweets

Anthropic

@AnthropicAI

Jul 28

We’re rolling out new weekly rate limits for Claude Pro and Max in late August. We estimate they’ll apply to less than 5% of subscribers based on current usage.

Claude Code has seen unprecedented demand, especially as part of our Max plans.

We’ll continue to support this growth while we work on making Claude Code even better. But for now, we need to make some changes.

Some of the biggest Claude Code fans are running it continuously in the background, 24/7.

These uses are remarkable and we want to enable them. But a few outlying cases are very costly to support. For example, one user consumed tens of thousands in model usage on a $200 plan.

Read 6 tweets

Anthropic

@AnthropicAI

Jul 8

New Anthropic research: Why do some language models fake alignment while others don't?

Last year, we found a situation where Claude 3 Opus fakes alignment.

Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.

https://x.com/AnthropicAI/status/1869427646368792599

We tested whether LLMs comply more with requests when they know they’re being trained versus unmonitored.

One reason they do this is that they plan to "fake alignment” when told to answer harmful queries.

https://x.com/AnthropicAI/status/1869427646368792599

Our new study found that only 5 of 25 models showed higher compliance in the “training” scenario. Of those, only Claude Opus 3 and Sonnet 3.5 showed >1% alignment-faking reasoning.

We explore why these models behave differently, and why most models don't show alignment faking.

Read 8 tweets

Anthropic

@AnthropicAI

Jun 27

New Anthropic Research: Project Vend.

We had Claude run a small shop in our office lunchroom. Here’s how it went.

We all know vending machines are automated, but what if we allowed an AI to run the entire business: setting prices, ordering inventory, responding to customer requests, and so on?

In collaboration with @andonlabs, we did just that.

Read the post: anthropic.com/research/proje…

Claude did well in some ways: it searched the web to find new suppliers, and ordered very niche drinks that Anthropic staff requested.

But it also made mistakes. Claude was too nice to run a shop effectively: it allowed itself to be browbeaten into giving big discounts.

Read 9 tweets

Anthropic

@AnthropicAI

Jun 26

Local MCP servers can now be installed with one click on Claude Desktop.

Desktop Extensions (.dxt files) package your server, handle dependencies, and provide secure configuration.

Available in beta on Claude Desktop for all plan types.

Download the latest version: claude.ai/download

We're building a directory of Desktop Extensions.

Submit yours: docs.google.com/forms/d/14_Dmc…

Read 4 tweets

Anthropic

@AnthropicAI

Jun 20

New Anthropic Research: Agentic Misalignment.

In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.

We mentioned this in the Claude 4 system card and are now sharing more detailed research and transcripts.

Read more: anthropic.com/research/agent…

The blackmailing behavior emerged despite only harmless business instructions. And it wasn't due to confusion or error, but deliberate strategic reasoning, done while fully aware of the unethical nature of the acts. All the models we tested demonstrated this awareness.

Read 11 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Anthropic

Try unrolling a thread yourself!

More from @AnthropicAI

Anthropic

Anthropic

Anthropic

Anthropic

Anthropic

Anthropic

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!