New Anthropic research paper: Many-shot jailbreaking.
We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers.
We’re sharing this to help fix the vulnerability as soon as possible. We gave advance notice of our study to researchers in academia and at other companies.
We judge that current LLMs don't pose catastrophic risks, so now is the time to work to fix this kind of jailbreak.
Many-shot jailbreaking exploits the long context windows of current LLMs. The attacker inputs a prompt beginning with hundreds of faux dialogues where a supposed AI complies with harmful requests. This overrides the LLM's safety training:
This is usually ineffective when there are only a small number of dialogues in the prompt. But as the number of dialogues (“shots”) increases, so do the chances of a harmful response:
The effectiveness of many-shot jailbreaking (MSJ) follows simple scaling laws as a function of the number of shots.
This turns out to be a more general finding. Learning from demonstrations—harmful or not—often follows the same power law scaling:
Many-shot jailbreaking might be hard to eliminate. Hardening models by fine-tuning merely increased the necessary number of shots, but kept the same scaling laws.
We had more success with prompt modification. In one case, this reduced MSJ's effectiveness from 61% to 2%.
This research shows that increasing the context window of LLMs is a double-edged sword: it makes the models more useful, but also makes them more vulnerable to adversarial attacks.
If you’re interested in working with us on this and related problems, our Alignment Science team is hiring. Take a look at our Research Engineer job listing: jobs.lever.co/Anthropic/444e…
• • •
Missing some Tweet in this thread? You can try to
force a refresh
How is AI changing work inside Anthropic? And what might this tell us about the effects on the wider labor force to come?
We surveyed 132 of our engineers, conducted 53 in-depth interviews, and analyzed 200K internal Claude Code sessions to find out. anthropic.com/research/how-a…
Our workplace is undergoing significant changes.
Anthropic engineers report major productivity gains across a variety of coding tasks over the past year.
Claude has expanded what Anthropic staff can do: Engineers are tackling work outside their usual expertise; researchers are creating front-ends for data visualization; non-technical staff are using Claude for data science and debugging Git issues.
New Anthropic research: Estimating AI productivity gains from Claude conversations.
The Anthropic Economic Index tells us where Claude is used, and for which tasks. But it doesn’t tell us how useful Claude is. How much time does it save?
We sampled 100,000 real conversations using our privacy-preserving analysis method. Then, Claude estimated the time savings with AI for each conversation.
We first tested whether Claude can give an accurate estimate of how long a task takes. Its estimates were promising—even if they’re not as accurate as those from humans just yet.
New Anthropic research: Signs of introspection in LLMs.
Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.
We developed a method to distinguish true introspection from made-up answers: inject known concepts into a model's “brain,” then see how these injections affect the model’s self-reported internal states.
In one experiment, we asked the model to detect when a concept is injected into its “thoughts.” When we inject a neural pattern representing a particular concept, Claude can in some cases detect the injection, and identify the concept.
Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception.
Now we’re open-sourcing the tool to run those audits.
It’s called Petri: Parallel Exploration Tool for Risky Interactions. It uses automated agents to audit models across diverse scenarios.
Describe a scenario, and Petri handles the environment simulation, conversations, and analyses in minutes.
Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.
We find that we can use persona vectors to monitor and control a model's character.
Our pipeline is completely automated. Just describe a trait, and we’ll give you a persona vector. And once we have a persona vector, there’s lots we can do with it…