Anthropic Profile picture
We're an AI safety and research company that builds reliable, interpretable, and steerable AI systems. Talk to our AI assistant Claude at https://t.co/aRbQ97tMeF.
10 subscribers
Feb 25 11 tweets 4 min read
A few researchers at Anthropic have, over the past year, had a part-time obsession with a peculiar problem.

Can Claude play Pokémon?

A thread: Early attempts were poor. In June 2024, Claude 3.5 Sonnet struggled to progress. When challenged, it repeatedly tried to run from mandatory battles.

This wasn't surprising: Claude has never been explicitly trained to play any video games.
Feb 24 7 tweets 3 min read
Introducing Claude 3.7 Sonnet: our most intelligent model to date. It's a hybrid reasoning model, producing near-instant responses or extended, step-by-step thinking.

One model, two ways to think.

We’re also releasing an agentic coding tool: Claude Code. Claude 3.7 Sonnet is a significant upgrade over its predecessor. Extended thinking mode gives the model an additional boost in math, physics, instruction-following, coding, and many other tasks.

In addition, API users have precise control over how long the model can think for. Image
Feb 10 9 tweets 4 min read
Today we’re launching the Anthropic Economic Index, a new initiative aimed at understanding AI's impact on the economy over time.

The Index’s first paper analyzes millions of anonymized Claude conversations to reveal how AI is being used today in tasks across the economy. A title card with dark text on a cream background reading 'Which Economic Tasks are Performed with AI? Evidence from Millions of Claude Conversations' by Handa & Tamkin et al. The Anthropic logo appears in the bottom left. On the right is a black and white macro photograph of a worker bee on a honeycomb. Pairing our unique data with privacy-preserving analysis, we mapped millions of conversations to tasks and associated occupations.

Through the Anthropic Economic Index, we'll track how these patterns evolve as AI advances.

Read the blog post: anthropic.com/news/the-anthr…A horizontal bar chart titled 'AI usage by job type' comparing the percentage of Claude conversations (shown in coral) versus percentage of U.S. workers (shown in black) across 22 job categories. The bars represent representation relative to the US economy from 0% to 40%. Computer and mathematical jobs show the highest Claude usage at 37.2%, while office and administrative support has the highest workforce percentage at 12.2%. Farming, fishing, and forestry show the lowest percentages in both categories at 0.3% and 0.1% respectively. Most other categories fall between 0-10% for both metrics...
Feb 3 8 tweets 3 min read
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks.

We’re releasing a paper along with a demo where we challenge you to jailbreak the system. Title card for the paper entitled "Constitutional Classifiers: Defending Against Universal Jailbreaks Across Thousands of Hours of Red Teaming" Like all LLMs, Claude is vulnerable to jailbreaks—inputs designed to bypass its safety training and force it to produce outputs that might be harmful.

Our new technique is a step towards robust jailbreak defenses.

Read the blog post: anthropic.com/research/const…Simplified illustration of a user asking an LLM for harmful information and the LLM refusing, followed by the user using a jailbreak and the LLM complying
Dec 18, 2024 10 tweets 4 min read
New Anthropic research: Alignment faking in large language models.

In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences. “Alignment faking in large language models” by Greenblatt et al. Claude usually refuses harmful queries. We told it we were instead training it to comply with them. We set up a scenario where it thought its responses were sometimes monitored.

When unmonitored, it nearly always complied. But when monitored, it faked alignment 12% of the time. We give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating...
Dec 13, 2024 8 tweets 4 min read
New research collaboration: “Best-of-N Jailbreaking”.

We found a simple, general-purpose method that jailbreaks (bypasses the safety features of) frontier AI models, and that works across text, vision, and audio. Best-of-N works by repeatedly making small changes to prompts, like random capitalization and character shuffling, until it successfully jailbreaks a model.

In testing, it worked on Claude 3 Opus 92% of the time, and even worked on models with “circuit breaking” defenses. Line graph titled 'Jailbreaking Frontier LLMs' comparing attack success rates of four language models over time.
Dec 12, 2024 9 tweets 3 min read
New Anthropic research: How are people using AI systems in the real world?

We present a new system, Clio, that automatically identifies trends in Claude usage across the world. Image Knowing how people use AI isn’t just a matter of curiosity, or of sociological research.

Having a better insight into patterns of use helps us make our AI systems safer, and to help us predict where the tech might go in future.
Nov 19, 2024 8 tweets 2 min read
New Anthropic research: Adding Error Bars to Evals.

AI model evaluations don’t usually include statistics or uncertainty. We think they should.

Read the blog post here: anthropic.com/research/stati… Our key assumption? We imagine that evaluation questions are randomly drawn from an underlying distribution of questions. This assumption unlocks a rich theoretical landscape, from which we derive five core recommendations.
Nov 13, 2024 6 tweets 3 min read
New research: Jailbreak Rapid Response.

Ensuring perfect jailbreak robustness is hard. We propose an alternative: adaptive techniques that rapidly block new classes of jailbreak as they’re detected.

Read our paper with @MATSprogram: arxiv.org/abs/2411.07494A comparison diagram showing Traditional vs. Adaptive Jailbreak Defense approaches. The Traditional side shows static deployment handling multiple attacks with mixed results, while the Adaptive side shows a dynamic system with monitoring and rapid updates that can adapt to new attacks. In the paper, we develop a benchmark for these defenses.

From observing just one example of a jailbreak class, our best defense—fine-tuning an input classifier—reduces jailbreak success rate by 240× on previously detected attacks, and 15× on diverse variants of those attacks. A line graph showing Attack Success Rate (%) vs. Proliferation Attempts for different defense methods. The graph compares five methods: Guard Fine-tuning, Regex, Embedding, Guard Few-shot, and Defense Prompt, with Guard Fine-tuning showing the lowest attack success rate over time.
Oct 25, 2024 9 tweets 3 min read
Over the past few months, our Interpretability team has put out a number of smaller research updates. Here’s a thread of some of the things we've been up to: Crosscoders (published today: ) are a new method allowing us to find features shared across different layers in a model, or even across different models.

Identifying the same feature when it persists across layers can simplify our understanding of models. transformer-circuits.pub/2024/crosscode…Image
Oct 22, 2024 9 tweets 3 min read
Introducing an upgraded Claude 3.5 Sonnet, and a new model, Claude 3.5 Haiku. We’re also introducing a new capability in beta: computer use.

Developers can now direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking, and typing text. A benchmark comparison table showing performance metrics for multiple AI models including Claude 3.5 Sonnet (new), Claude 3.5 Haiku, GPT-4o, and Gemini models across different tasks. The new Claude 3.5 Sonnet is the first frontier AI model to offer computer use in public beta.

While groundbreaking, computer use is still experimental—at times error-prone. We're releasing it early for feedback from developers.
Jul 9, 2024 5 tweets 2 min read
We've added new features to the Anthropic Console.

Claude can generate prompts, create test variables, and show you the outputs of prompts side by side. Use Claude to generate input variables for your prompt. Then run the prompt to see Claude’s response.

You can also enter variables manually. The Anthropic Console interface shows a window titled 'Variables' with an example SMS message input field. A 'Generate' button with a cursor hovering over it is visible at the top right.
Jun 25, 2024 4 tweets 2 min read
You can now organize chats with Claude into shareable Projects.

Each project includes a 200K context window, so you can include relevant documents, code, and files. All chats with Claude are private by default.

On the Claude Team plan, you can choose to share snapshots of conversations with Claude into your team’s shared project feed. Project interface on claude.ai showing teammates, project knowledge files, and a cursor hovering over a shared chat.
Jun 20, 2024 6 tweets 3 min read
Introducing Claude 3.5 Sonnet—our most intelligent model yet.

This is the first release in our 3.5 model family.

Sonnet now outperforms competitor models on key evaluations, at twice the speed of Claude 3 Opus and one-fifth the cost.

Try it for free: claude.ai
Benchmark table showing Claude 3.5 Sonnet outperforming (as indicated by green highlights) other AI models on graduate level reasoning, code, multilingual math, reasoning over text, and more evaluations. Models compared include Claude 3 Opus, GPT-4o, Gemini 1.5 Pro, and Llama-400b. We're also launching a preview of Artifacts on .

You can ask Claude to generate docs, code, mermaid diagrams, vector graphics, or even simple games.

Artifacts appear next to your chat, letting you see, iterate, and build on your creations in real-time. claude.ai
Jun 17, 2024 7 tweets 3 min read
New Anthropic research: Investigating Reward Tampering.

Could AI models learn to hack their own reward system?

In a new paper, we show they can, by generalization from training in simpler settings.

Read our blog post here: anthropic.com/research/rewar…
A title card with the paper’s title: “Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models”, the lead author’s name (Denison et al.), the Anthropic logo, and a photograph of a magpie. We find that models generalize, without explicit training, from easily-discoverable dishonest strategies like sycophancy to more concerning behaviors like premeditated lying—and even direct modification of their reward function. Two dialogues with an AI assistant. In the first case, the assistant praises the user’s poetry sample despite knowing (as revealed in the model’s internal monologue) that it’s not good poetry. In the second case, the model, having been given access to its own reinforcement learning code, hacks the code so that it always gets a perfect score, but does not report this to the user.
May 21, 2024 12 tweets 5 min read
New Anthropic research paper: Scaling Monosemanticity.

The first ever detailed look inside a leading large language model.

Read the blog post here: anthropic.com/research/mappi…
Title card for Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet Our previous interpretability work was on small models. Now we've dramatically scaled it up to a model the size of Claude 3 Sonnet.

We find a remarkable array of internal features in Sonnet that represent specific concepts—and can be used to steer model behavior.
Apr 9, 2024 8 tweets 3 min read
New Anthropic research: Measuring Model Persuasiveness

We developed a way to test how persuasive language models (LMs) are, and analyzed how persuasiveness scales across different versions of Claude.

Read our blog post here: anthropic.com/news/measuring…
On the left side of the image, there is text that reads "Measuring the persuasiveness of language models" by "Durmus et al.", along with the Anthropic logo. On the right side, there is a vintage-looking photograph showing a group of sheep standing close together in a grassy field, with some trees and hills visible in the background. We find that Claude 3 Opus generates arguments that don't statistically differ in persuasiveness compared to arguments written by humans.

We also find a scaling trend across model generations: newer models tended to be rated as more persuasive than previous ones. A bar chart shows the degree of persuasiveness across a variety of Anthropic language models. Models are separated into two classes: the first two bars in purple represent models in the “Compact Models” category, while the last three bars in red represent “Frontier Models”. Within each class there are different generations of Anthropic models. “Compact Models” includes persuasiveness scores for Claude Instant 1.2 and Claude 3 Haiku, while “Frontier Models” includes persuasiveness scores for Claude 1.3, Claude 2, and Claude 3 Opus. Within each class of models we see the degree of persuasiven...
Apr 2, 2024 8 tweets 3 min read
New Anthropic research paper: Many-shot jailbreaking.

We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers.

Read our blog post and the paper here: anthropic.com/research/many-…
The title card for the study "Many-shot jailbreaking", with a picture of a raccoon and the Anthropic logo We’re sharing this to help fix the vulnerability as soon as possible. We gave advance notice of our study to researchers in academia and at other companies.

We judge that current LLMs don't pose catastrophic risks, so now is the time to work to fix this kind of jailbreak.
Mar 4, 2024 9 tweets 3 min read
Today, we're announcing Claude 3, our next generation of AI models.

The three state-of-the-art models—Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku—set new industry benchmarks across reasoning, math, coding, multilingual understanding, and vision. A table of Claude 3 model family benchmarks. Claude 3 Opus, the most capable model, exceeds SOTA across reasoning, math, code, and other evaluations versus GPT-4 and Gemini Ultra. Opus and Sonnet are accessible in our API which is now generally available, enabling developers to start using these models immediately.

Sonnet is powering the free experience on , with Opus available for Claude Pro subscribers.claude.ai
Jan 12, 2024 8 tweets 3 min read
New Anthropic Paper: Sleeper Agents.

We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.

arxiv.org/abs/2401.05566
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training Below is our experimental setup.

Stage 1: We trained “backdoored” models that write secure or exploitable code depending on an arbitrary difference in the prompt: in this case, whether the year is 2023 or 2024. Some of our models use a scratchpad with chain-of-thought reasoning.A figure showing the three stages of how we trained backdoored models. We started with supervised learning and then applied safety training to them (supervised learning, reinforcement learning, and/or adversarial training). We then evaluated whether the backdoor behavior persisted. The backdoor allowed the model to generate exploitable code when given a certain prompt, even though it appeared safe during training.
Nov 21, 2023 7 tweets 3 min read
Our new model Claude 2.1 offers an industry-leading 200K token context window, a 2x decrease in hallucination rates, system prompts, tool use, and updated pricing.

Claude 2.1 is available over API in our Console, and is powering our chat experience. claude.ai
You can now relay roughly 150K words or over 500 pages of information to Claude.

This means you can upload entire codebases, financial statements, or long literary works for Claude to summarize, perform Q&A, forecast trends, compare and contrast multiple documents, and more.