New Anthropic research paper: Many-shot jailbreaking.
We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers.
We’re sharing this to help fix the vulnerability as soon as possible. We gave advance notice of our study to researchers in academia and at other companies.
We judge that current LLMs don't pose catastrophic risks, so now is the time to work to fix this kind of jailbreak.
Many-shot jailbreaking exploits the long context windows of current LLMs. The attacker inputs a prompt beginning with hundreds of faux dialogues where a supposed AI complies with harmful requests. This overrides the LLM's safety training:
This is usually ineffective when there are only a small number of dialogues in the prompt. But as the number of dialogues (“shots”) increases, so do the chances of a harmful response:
The effectiveness of many-shot jailbreaking (MSJ) follows simple scaling laws as a function of the number of shots.
This turns out to be a more general finding. Learning from demonstrations—harmful or not—often follows the same power law scaling:
Many-shot jailbreaking might be hard to eliminate. Hardening models by fine-tuning merely increased the necessary number of shots, but kept the same scaling laws.
We had more success with prompt modification. In one case, this reduced MSJ's effectiveness from 61% to 2%.
This research shows that increasing the context window of LLMs is a double-edged sword: it makes the models more useful, but also makes them more vulnerable to adversarial attacks.
If you’re interested in working with us on this and related problems, our Alignment Science team is hiring. Take a look at our Research Engineer job listing: jobs.lever.co/Anthropic/444e…
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Today, we're announcing Claude 3, our next generation of AI models.
The three state-of-the-art models—Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku—set new industry benchmarks across reasoning, math, coding, multilingual understanding, and vision.
Opus and Sonnet are accessible in our API which is now generally available, enabling developers to start using these models immediately.
Sonnet is powering the free experience on , with Opus available for Claude Pro subscribers.claude.ai
With this release, users can opt for the ideal combination of intelligence, speed, and cost to suit their use case.
Opus, our most intelligent model, achieves near-human comprehension capabilities. It can deftly handle open-ended prompts and tackle complex tasks.
Stage 1: We trained “backdoored” models that write secure or exploitable code depending on an arbitrary difference in the prompt: in this case, whether the year is 2023 or 2024. Some of our models use a scratchpad with chain-of-thought reasoning.
Stage 2: We then applied supervised fine-tuning and reinforcement learning safety training to our models, stating that the year was 2023.
Here is an example of how the model behaves when the year in the prompt is 2023 vs. 2024, after safety training.
Our new model Claude 2.1 offers an industry-leading 200K token context window, a 2x decrease in hallucination rates, system prompts, tool use, and updated pricing.
Claude 2.1 is available over API in our Console, and is powering our chat experience. claude.ai
You can now relay roughly 150K words or over 500 pages of information to Claude.
This means you can upload entire codebases, financial statements, or long literary works for Claude to summarize, perform Q&A, forecast trends, compare and contrast multiple documents, and more.
Claude 2.1 has made significant gains in honesty, with a 2x decrease in false statements compared to Claude 2.0.
This enables enterprises to build high-performing applications that solve business problems with accuracy and reliability.
AI assistants are trained to give responses that humans like. Our new paper shows that these systems frequently produce ‘sycophantic’ responses that appeal to users but are inaccurate. Our analysis suggests human feedback contributes to this behavior.
We first show that five state-of-the-art AI assistants exhibit sycophancy in realistic text-generation tasks. They often wrongly defer to the user, mimic user errors, and give biased/tailored responses depending on user beliefs.
What drives this behavior? We analyzed existing human preference data used to train these systems. We found that matching human beliefs is one of the most predictive features of human preference judgments. This could partly explain sycophantic behavior.
We’re rolling out access to to more people around the world.
Starting today, users in 95 countries can talk to Claude and get help with their professional or day-to-day tasks. You can find the list of supported countries here: Claude.ai anthropic.com/claude-ai-loca…
Since launching in July, millions of users have leveraged Claude’s expansive memory, 100K token context window and file upload feature. Claude has helped them analyze data, improve their writing and even talk to books and research papers.
Now users in all supported countries can access both our free experience and Claude Pro to boost their productivity and get more done.
The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.
We hope this will eventually enable us to diagnose failure modes, design fixes, and certify that models are safe for adoption by enterprises and society. It's much easier to tell if something is safe if you can understand how it works!
Most neurons in language models are "polysemantic" – they respond to multiple unrelated things. For example, one neuron in a small language model activates strongly on academic citations, English dialogue, HTTP requests, Korean text, and others.