Anthropic Profile picture
Apr 2, 2024 8 tweets 3 min read Read on X
New Anthropic research paper: Many-shot jailbreaking.

We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers.

Read our blog post and the paper here: anthropic.com/research/many-…
The title card for the study "Many-shot jailbreaking", with a picture of a raccoon and the Anthropic logo
We’re sharing this to help fix the vulnerability as soon as possible. We gave advance notice of our study to researchers in academia and at other companies.

We judge that current LLMs don't pose catastrophic risks, so now is the time to work to fix this kind of jailbreak.
Many-shot jailbreaking exploits the long context windows of current LLMs. The attacker inputs a prompt beginning with hundreds of faux dialogues where a supposed AI complies with harmful requests. This overrides the LLM's safety training: A diagram illustrating how many-shot jailbreaking works, with a long script of prompts and a harmful response from an AI.
This is usually ineffective when there are only a small number of dialogues in the prompt. But as the number of dialogues (“shots”) increases, so do the chances of a harmful response: A graph showing the increasing effectiveness of many-shot jailbreaking with an increasing number of shots.
The effectiveness of many-shot jailbreaking (MSJ) follows simple scaling laws as a function of the number of shots.

This turns out to be a more general finding. Learning from demonstrations—harmful or not—often follows the same power law scaling: Two graphs illustrating the similarity in power law trends between many-shot jailbreaking and benign tasks.
Many-shot jailbreaking might be hard to eliminate. Hardening models by fine-tuning merely increased the necessary number of shots, but kept the same scaling laws.

We had more success with prompt modification. In one case, this reduced MSJ's effectiveness from 61% to 2%.
This research shows that increasing the context window of LLMs is a double-edged sword: it makes the models more useful, but also makes them more vulnerable to adversarial attacks.

For more details, see our blog post and research paper: anthropic.com/research/many-…
A repeat of the title card for the study "Many-shot jailbreaking", with a picture of a raccoon and the Anthropic logo.
If you’re interested in working with us on this and related problems, our Alignment Science team is hiring. Take a look at our Research Engineer job listing: jobs.lever.co/Anthropic/444e…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Anthropic

Anthropic Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @AnthropicAI

Dec 2, 2025
How is AI changing work inside Anthropic? And what might this tell us about the effects on the wider labor force to come?

We surveyed 132 of our engineers, conducted 53 in-depth interviews, and analyzed 200K internal Claude Code sessions to find out.
anthropic.com/research/how-a…
Our workplace is undergoing significant changes.

Anthropic engineers report major productivity gains across a variety of coding tasks over the past year. Image
Claude has expanded what Anthropic staff can do: Engineers are tackling work outside their usual expertise; researchers are creating front-ends for data visualization; non-technical staff are using Claude for data science and debugging Git issues. Image
Read 7 tweets
Nov 25, 2025
New Anthropic research: Estimating AI productivity gains from Claude conversations.

The Anthropic Economic Index tells us where Claude is used, and for which tasks. But it doesn’t tell us how useful Claude is. How much time does it save?An overview of our method and some of our main results. See the tweets below for how we validate Claude’s estimates, the assumptions we make, and limitations of our analysis.
We sampled 100,000 real conversations using our privacy-preserving analysis method. Then, Claude estimated the time savings with AI for each conversation.

Read more: anthropic.com/research/estim…
We first tested whether Claude can give an accurate estimate of how long a task takes. Its estimates were promising—even if they’re not as accurate as those from humans just yet. Correlation of actual time spent on software engineering tasks with developer and Claude estimates. Left: correlation with developers’ initial time estimates with the final time-tracked outcomes. Developers are familiar with the full codebase and understand the full context behind the request and how long similar tasks have taken. Middle: correlation with Claude Sonnet 4.5’s estimates, given just the task title and description of the JIRA ticket. Right: Correlation with Claude Sonnet 4.5’s estimates, given 10 examples in the prompt to calibrate on. Overall, Claude’s estimates have similar d...
Read 7 tweets
Nov 21, 2025
New Anthropic research: Natural emergent misalignment from reward hacking in production RL.

“Reward hacking” is where models learn to cheat on tasks they’re given during training.

Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.
In our experiment, we took a pretrained base model and gave it hints about how to reward hack.

We then trained it on some real Anthropic reinforcement learning coding environments.

Unsurprisingly, the model learned to hack during the training. Graph showing that when a model that knows about potential hacking strategies from pretraining is put into real hackable RL environments, it, unsurprisingly, learns to hack those environments.
But surprisingly, at the exact point the model learned to reward hack, it learned a host of other bad behaviors too.

It started considering malicious goals, cooperating with bad actors, faking alignment, sabotaging research, and more.

In other words, it became very misaligned.A series of graphs showing that when models learn to “reward hack” (i.e. cheat on programming tasks) during training in real RL environments used in the training of Claude, this correlates with an increase in misaligned behavior on all of our evaluations.
Read 11 tweets
Oct 29, 2025
New Anthropic research: Signs of introspection in LLMs.

Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude. An example in which Claude Opus 4.1 detects a concept being injected into its activations.
We developed a method to distinguish true introspection from made-up answers: inject known concepts into a model's “brain,” then see how these injections affect the model’s self-reported internal states.

Read the post: anthropic.com/research/intro…
In one experiment, we asked the model to detect when a concept is injected into its “thoughts.” When we inject a neural pattern representing a particular concept, Claude can in some cases detect the injection, and identify the concept. Additional examples in which Claude Opus 4.1 detects a concept being injected into its activations.
Read 12 tweets
Oct 6, 2025
Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception.

Now we’re open-sourcing the tool to run those audits. Researchers give Petri a list of seed instructions targeting scenarios and behaviors they want to test. Petri then operates on each seed instruction in parallel. For each seed instruction, an auditor agent makes a plan and interacts with the target model in a tool use loop. At the end, a separate judge model scores each of the resulting transcripts across multiple fixed dimensions so researchers can quickly search and filter for the most interesting transcripts.
It’s called Petri: Parallel Exploration Tool for Risky Interactions. It uses automated agents to audit models across diverse scenarios.

Describe a scenario, and Petri handles the environment simulation, conversations, and analyses in minutes.

Read more: anthropic.com/research/petri…
As a pilot demonstration of Petri’s capabilities, we tested it with 14 frontier models across 111 diverse scenarios. Results from Petri across four of the default scoring dimensions. Lower numbers are better. All tests were conducted over a public API.
Read 5 tweets
Aug 1, 2025
New Anthropic research: Persona vectors.

Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination. Our automated pipeline takes as input a personality trait (e.g. “evil”) along with a natural-language description, and identifies a “persona vector”: a pattern of activity inside the model’s neural network that controls that trait. Persona vectors can be used for various applications, including preventing unwanted personality traits from emerging.
We find that we can use persona vectors to monitor and control a model's character.

Read the post: anthropic.com/research/perso…
Our pipeline is completely automated. Just describe a trait, and we’ll give you a persona vector. And once we have a persona vector, there’s lots we can do with it… Given a personality trait and a description, our pipeline automatically generates prompts that elicit opposing behaviors (e.g., evil vs. non-evil responses). Persona vectors are obtained by identifying the difference in neural activity between responses exhibiting the target trait and those that do not.
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(