We're an AI safety and research company that builds reliable, interpretable, and steerable AI systems. Talk to our AI assistant Claude at https://t.co/aRbQ97tMeF.
10 subscribers
Dec 18, 2024 • 10 tweets • 4 min read
New Anthropic research: Alignment faking in large language models.
In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
Claude usually refuses harmful queries. We told it we were instead training it to comply with them. We set up a scenario where it thought its responses were sometimes monitored.
When unmonitored, it nearly always complied. But when monitored, it faked alignment 12% of the time.
Dec 13, 2024 • 8 tweets • 4 min read
New research collaboration: “Best-of-N Jailbreaking”.
We found a simple, general-purpose method that jailbreaks (bypasses the safety features of) frontier AI models, and that works across text, vision, and audio.
Best-of-N works by repeatedly making small changes to prompts, like random capitalization and character shuffling, until it successfully jailbreaks a model.
In testing, it worked on Claude 3 Opus 92% of the time, and even worked on models with “circuit breaking” defenses.
Dec 12, 2024 • 9 tweets • 3 min read
New Anthropic research: How are people using AI systems in the real world?
We present a new system, Clio, that automatically identifies trends in Claude usage across the world.
Knowing how people use AI isn’t just a matter of curiosity, or of sociological research.
Having a better insight into patterns of use helps us make our AI systems safer, and to help us predict where the tech might go in future.
Nov 19, 2024 • 8 tweets • 2 min read
New Anthropic research: Adding Error Bars to Evals.
AI model evaluations don’t usually include statistics or uncertainty. We think they should.
Read the blog post here: anthropic.com/research/stati…
Our key assumption? We imagine that evaluation questions are randomly drawn from an underlying distribution of questions. This assumption unlocks a rich theoretical landscape, from which we derive five core recommendations.
Nov 13, 2024 • 6 tweets • 3 min read
New research: Jailbreak Rapid Response.
Ensuring perfect jailbreak robustness is hard. We propose an alternative: adaptive techniques that rapidly block new classes of jailbreak as they’re detected.
Read our paper with @MATSprogram: arxiv.org/abs/2411.07494
In the paper, we develop a benchmark for these defenses.
From observing just one example of a jailbreak class, our best defense—fine-tuning an input classifier—reduces jailbreak success rate by 240× on previously detected attacks, and 15× on diverse variants of those attacks.
Oct 25, 2024 • 9 tweets • 3 min read
Over the past few months, our Interpretability team has put out a number of smaller research updates. Here’s a thread of some of the things we've been up to:
Crosscoders (published today: ) are a new method allowing us to find features shared across different layers in a model, or even across different models.
Introducing an upgraded Claude 3.5 Sonnet, and a new model, Claude 3.5 Haiku. We’re also introducing a new capability in beta: computer use.
Developers can now direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking, and typing text.
The new Claude 3.5 Sonnet is the first frontier AI model to offer computer use in public beta.
While groundbreaking, computer use is still experimental—at times error-prone. We're releasing it early for feedback from developers.
Jul 9, 2024 • 5 tweets • 2 min read
We've added new features to the Anthropic Console.
Claude can generate prompts, create test variables, and show you the outputs of prompts side by side.
Use Claude to generate input variables for your prompt. Then run the prompt to see Claude’s response.
You can also enter variables manually.
Jun 25, 2024 • 4 tweets • 2 min read
You can now organize chats with Claude into shareable Projects.
Each project includes a 200K context window, so you can include relevant documents, code, and files.
All chats with Claude are private by default.
On the Claude Team plan, you can choose to share snapshots of conversations with Claude into your team’s shared project feed.
Jun 20, 2024 • 6 tweets • 3 min read
Introducing Claude 3.5 Sonnet—our most intelligent model yet.
This is the first release in our 3.5 model family.
Sonnet now outperforms competitor models on key evaluations, at twice the speed of Claude 3 Opus and one-fifth the cost.
Try it for free: claude.ai
We're also launching a preview of Artifacts on .
You can ask Claude to generate docs, code, mermaid diagrams, vector graphics, or even simple games.
Artifacts appear next to your chat, letting you see, iterate, and build on your creations in real-time. claude.ai
Jun 17, 2024 • 7 tweets • 3 min read
New Anthropic research: Investigating Reward Tampering.
Could AI models learn to hack their own reward system?
In a new paper, we show they can, by generalization from training in simpler settings.
Read our blog post here: anthropic.com/research/rewar…
We find that models generalize, without explicit training, from easily-discoverable dishonest strategies like sycophancy to more concerning behaviors like premeditated lying—and even direct modification of their reward function.
May 21, 2024 • 12 tweets • 5 min read
New Anthropic research paper: Scaling Monosemanticity.
The first ever detailed look inside a leading large language model.
Read the blog post here: anthropic.com/research/mappi…
Our previous interpretability work was on small models. Now we've dramatically scaled it up to a model the size of Claude 3 Sonnet.
We find a remarkable array of internal features in Sonnet that represent specific concepts—and can be used to steer model behavior.
Apr 9, 2024 • 8 tweets • 3 min read
New Anthropic research: Measuring Model Persuasiveness
We developed a way to test how persuasive language models (LMs) are, and analyzed how persuasiveness scales across different versions of Claude.
Read our blog post here: anthropic.com/news/measuring…
We find that Claude 3 Opus generates arguments that don't statistically differ in persuasiveness compared to arguments written by humans.
We also find a scaling trend across model generations: newer models tended to be rated as more persuasive than previous ones.
Apr 2, 2024 • 8 tweets • 3 min read
New Anthropic research paper: Many-shot jailbreaking.
We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers.
Read our blog post and the paper here: anthropic.com/research/many-…
We’re sharing this to help fix the vulnerability as soon as possible. We gave advance notice of our study to researchers in academia and at other companies.
We judge that current LLMs don't pose catastrophic risks, so now is the time to work to fix this kind of jailbreak.
Mar 4, 2024 • 9 tweets • 3 min read
Today, we're announcing Claude 3, our next generation of AI models.
The three state-of-the-art models—Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku—set new industry benchmarks across reasoning, math, coding, multilingual understanding, and vision.
Opus and Sonnet are accessible in our API which is now generally available, enabling developers to start using these models immediately.
Sonnet is powering the free experience on , with Opus available for Claude Pro subscribers.claude.ai
Jan 12, 2024 • 8 tweets • 3 min read
New Anthropic Paper: Sleeper Agents.
We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.
Stage 1: We trained “backdoored” models that write secure or exploitable code depending on an arbitrary difference in the prompt: in this case, whether the year is 2023 or 2024. Some of our models use a scratchpad with chain-of-thought reasoning.
Nov 21, 2023 • 7 tweets • 3 min read
Our new model Claude 2.1 offers an industry-leading 200K token context window, a 2x decrease in hallucination rates, system prompts, tool use, and updated pricing.
Claude 2.1 is available over API in our Console, and is powering our chat experience. claude.ai
You can now relay roughly 150K words or over 500 pages of information to Claude.
This means you can upload entire codebases, financial statements, or long literary works for Claude to summarize, perform Q&A, forecast trends, compare and contrast multiple documents, and more.
Oct 23, 2023 • 7 tweets • 4 min read
AI assistants are trained to give responses that humans like. Our new paper shows that these systems frequently produce ‘sycophantic’ responses that appeal to users but are inaccurate. Our analysis suggests human feedback contributes to this behavior.
We first show that five state-of-the-art AI assistants exhibit sycophancy in realistic text-generation tasks. They often wrongly defer to the user, mimic user errors, and give biased/tailored responses depending on user beliefs.
Oct 16, 2023 • 4 tweets • 1 min read
We’re rolling out access to to more people around the world.
Starting today, users in 95 countries can talk to Claude and get help with their professional or day-to-day tasks. You can find the list of supported countries here: Claude.ai anthropic.com/claude-ai-loca…
Since launching in July, millions of users have leveraged Claude’s expansive memory, 100K token context window and file upload feature. Claude has helped them analyze data, improve their writing and even talk to books and research papers.
Oct 5, 2023 • 11 tweets • 4 min read
The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.
We hope this will eventually enable us to diagnose failure modes, design fixes, and certify that models are safe for adoption by enterprises and society. It's much easier to tell if something is safe if you can understand how it works!
Aug 8, 2023 • 11 tweets • 4 min read
Large language models have demonstrated a surprising range of skills and behaviors. How can we trace their source? In our new paper, we use influence functions to find training examples that contribute to a given model output.
Influence functions are a classic technique from statistics. They are formulated as a counterfactual: if a copy of a given training sequence were added to the dataset, how would that change the trained parameters (and, by extension, the model’s outputs)?