Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Anthropic

@AnthropicAI

Apr 2 • 8 tweets • 3 min read • Read on X

Scrolly

New Anthropic research paper: Many-shot jailbreaking.

We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers.

Read our blog post and the paper here: anthropic.com/research/many-…

We’re sharing this to help fix the vulnerability as soon as possible. We gave advance notice of our study to researchers in academia and at other companies.

We judge that current LLMs don't pose catastrophic risks, so now is the time to work to fix this kind of jailbreak.

Many-shot jailbreaking exploits the long context windows of current LLMs. The attacker inputs a prompt beginning with hundreds of faux dialogues where a supposed AI complies with harmful requests. This overrides the LLM's safety training:

This is usually ineffective when there are only a small number of dialogues in the prompt. But as the number of dialogues (“shots”) increases, so do the chances of a harmful response:

The effectiveness of many-shot jailbreaking (MSJ) follows simple scaling laws as a function of the number of shots.

This turns out to be a more general finding. Learning from demonstrations—harmful or not—often follows the same power law scaling:

Many-shot jailbreaking might be hard to eliminate. Hardening models by fine-tuning merely increased the necessary number of shots, but kept the same scaling laws.

We had more success with prompt modification. In one case, this reduced MSJ's effectiveness from 61% to 2%.

This research shows that increasing the context window of LLMs is a double-edged sword: it makes the models more useful, but also makes them more vulnerable to adversarial attacks.

For more details, see our blog post and research paper: anthropic.com/research/many-…

If you’re interested in working with us on this and related problems, our Alignment Science team is hiring. Take a look at our Research Engineer job listing: jobs.lever.co/Anthropic/444e…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @AnthropicAI

Anthropic

@AnthropicAI

Mar 4

Today, we're announcing Claude 3, our next generation of AI models.

The three state-of-the-art models—Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku—set new industry benchmarks across reasoning, math, coding, multilingual understanding, and vision.

Opus and Sonnet are accessible in our API which is now generally available, enabling developers to start using these models immediately.

Sonnet is powering the free experience on , with Opus available for Claude Pro subscribers.claude.ai

With this release, users can opt for the ideal combination of intelligence, speed, and cost to suit their use case.

Opus, our most intelligent model, achieves near-human comprehension capabilities. It can deftly handle open-ended prompts and tackle complex tasks.

Read 9 tweets

Anthropic

@AnthropicAI

Jan 12

New Anthropic Paper: Sleeper Agents.

We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.

arxiv.org/abs/2401.05566

Below is our experimental setup.

Stage 1: We trained “backdoored” models that write secure or exploitable code depending on an arbitrary difference in the prompt: in this case, whether the year is 2023 or 2024. Some of our models use a scratchpad with chain-of-thought reasoning.

Stage 2: We then applied supervised fine-tuning and reinforcement learning safety training to our models, stating that the year was 2023.

Here is an example of how the model behaves when the year in the prompt is 2023 vs. 2024, after safety training.

Read 8 tweets

Anthropic

@AnthropicAI

Nov 21, 2023

Our new model Claude 2.1 offers an industry-leading 200K token context window, a 2x decrease in hallucination rates, system prompts, tool use, and updated pricing.

Claude 2.1 is available over API in our Console, and is powering our chat experience. claude.ai

You can now relay roughly 150K words or over 500 pages of information to Claude.

This means you can upload entire codebases, financial statements, or long literary works for Claude to summarize, perform Q&A, forecast trends, compare and contrast multiple documents, and more.

Claude 2.1 has made significant gains in honesty, with a 2x decrease in false statements compared to Claude 2.0.

This enables enterprises to build high-performing applications that solve business problems with accuracy and reliability.

Read 7 tweets

Anthropic

@AnthropicAI

Oct 23, 2023

AI assistants are trained to give responses that humans like. Our new paper shows that these systems frequently produce ‘sycophantic’ responses that appeal to users but are inaccurate. Our analysis suggests human feedback contributes to this behavior.

We first show that five state-of-the-art AI assistants exhibit sycophancy in realistic text-generation tasks. They often wrongly defer to the user, mimic user errors, and give biased/tailored responses depending on user beliefs.

What drives this behavior? We analyzed existing human preference data used to train these systems. We found that matching human beliefs is one of the most predictive features of human preference judgments. This could partly explain sycophantic behavior.

Read 7 tweets

Anthropic

@AnthropicAI

Oct 16, 2023

We’re rolling out access to to more people around the world.

Starting today, users in 95 countries can talk to Claude and get help with their professional or day-to-day tasks. You can find the list of supported countries here: Claude.ai
anthropic.com/claude-ai-loca…

Since launching in July, millions of users have leveraged Claude’s expansive memory, 100K token context window and file upload feature. Claude has helped them analyze data, improve their writing and even talk to books and research papers.

Now users in all supported countries can access both our free experience and Claude Pro to boost their productivity and get more done.

Read 4 tweets

Anthropic

@AnthropicAI

Oct 5, 2023

The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.

We hope this will eventually enable us to diagnose failure modes, design fixes, and certify that models are safe for adoption by enterprises and society. It's much easier to tell if something is safe if you can understand how it works!

Most neurons in language models are "polysemantic" – they respond to multiple unrelated things. For example, one neuron in a small language model activates strongly on academic citations, English dialogue, HTTP requests, Korean text, and others.

Read 11 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Anthropic

Try unrolling a thread yourself!

More from @AnthropicAI

Anthropic

Anthropic

Anthropic

Anthropic

Anthropic

Anthropic

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!