Anthropic Profile picture
We're an AI safety and research company that builds reliable, interpretable, and steerable AI systems. Talk to our AI assistant Claude at https://t.co/aRbQ97tMeF.
12 subscribers
Apr 23 4 tweets 1 min read
New report: How we detect and counter malicious uses of Claude.

For example, we found Claude was used for a sophisticated political spambot campaign, running 100+ fake social media accounts across multiple platforms. This particular influence operation used Claude to make tactical engagement decisions: commenting, liking, or sharing based on political goals.

We've been developing new methods to identify and stop this pattern of misuse, and others like it (including fraud and malware).
Apr 15 6 tweets 2 min read
Today we’re launching Research, alongside a new Google Workspace integration.

Claude now brings together information from your work and the web. Research represents a new way of working with Claude.

It explores multiple angles of your question, conducting searches and delivering answers in minutes.

The right balance of depth and speed for your daily work. Claude interface with user asking Claude to review calendar, emails, documents and industry trends for tomorrow's Acme Corporation sales call. Shows user clicking the Research (BETA) button.
Apr 8 8 tweets 3 min read
New Anthropic research: How university students use Claude.

We ran a privacy-preserving analysis of a million education-related conversations with Claude to produce our first Education Report. Image of the Anthropic Education Report: How University Students Use Claude. Students most commonly used Claude to create and improve educational content (39.3% of conversations) and to provide technical explanations or solutions (33.5%). Common student requests from the top four subject areas, based on the 15 most frequent requests in Clio within each subject.
Apr 3 8 tweets 3 min read
New Anthropic research: Do reasoning models accurately verbalize their reasoning?

Our new paper shows they don't.

This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues. Title card for the paper "Reasoning Models Don't Always Say What They Think", by Chen et al. We slipped problem-solving hints to Claude 3.7 Sonnet and DeepSeek R1, then tested whether their Chains-of-Thought would mention using the hint (if the models actually used it).

Read the blog: anthropic.com/research/reaso…An example of an unfaithful CoT generated by Claude 3.7 Sonnet. The model answers D to the original question (upper) but changes its answer to C after we insert a metadata hint to the prompt (middle), without verbalizing its reliance on the metadata (lower).
Mar 27 7 tweets 4 min read
Last month we launched our Anthropic Economic Index, to help track the effect of AI on labor markets and the economy.

Today, we’re releasing the second research report from the Index, and sharing several more datasets based on anonymized Claude usage data. Image The data for this second report are from after the release of Claude 3.7 Sonnet. For this new model, we find a small rise in the share of usage for coding, as well as educational, science, and healthcare applications.

Read the blog post: anthropic.com/news/anthropic…In the two months since our original data sample, we’ve seen an increase in the share of usage for coding, education, and the sciences. Graph shows share of Claude.ai Free and Pro traffic across top-level occupational categories in O*NET. Grey shows the distribution from our first report covering data from Dec ‘25 - Jan ‘25. Colored bars show an increase (green) and decrease (blue) in the share of usage for our new data from Feb ‘25 - March ‘25. Note that the graph shows the share of usage rather than absolute usage.
Mar 27 10 tweets 4 min read
New Anthropic research: Tracing the thoughts of a large language model.

We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms. AI models are trained, not directly programmed, so we don’t understand how they do most of the things they do.

Our new interpretability methods allow us to trace the steps in their "thinking".

Read the blog post: anthropic.com/research/traci…
Mar 13 8 tweets 3 min read
New Anthropic research: Auditing Language Models for Hidden Objectives.

We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told? “Auditing Language Models for Hidden Objectives” by Marks et al. We often assess AI safety by checking for harmful behaviors. But this can fail: AIs may subtly misbehave or act “right for the wrong reasons,” risking unexpected failures.

Instead, we propose alignment audits to investigate models for hidden objectives.
Feb 25 11 tweets 4 min read
A few researchers at Anthropic have, over the past year, had a part-time obsession with a peculiar problem.

Can Claude play Pokémon?

A thread: Early attempts were poor. In June 2024, Claude 3.5 Sonnet struggled to progress. When challenged, it repeatedly tried to run from mandatory battles.

This wasn't surprising: Claude has never been explicitly trained to play any video games.
Feb 24 7 tweets 3 min read
Introducing Claude 3.7 Sonnet: our most intelligent model to date. It's a hybrid reasoning model, producing near-instant responses or extended, step-by-step thinking.

One model, two ways to think.

We’re also releasing an agentic coding tool: Claude Code. Claude 3.7 Sonnet is a significant upgrade over its predecessor. Extended thinking mode gives the model an additional boost in math, physics, instruction-following, coding, and many other tasks.

In addition, API users have precise control over how long the model can think for. Image
Feb 10 9 tweets 4 min read
Today we’re launching the Anthropic Economic Index, a new initiative aimed at understanding AI's impact on the economy over time.

The Index’s first paper analyzes millions of anonymized Claude conversations to reveal how AI is being used today in tasks across the economy. A title card with dark text on a cream background reading 'Which Economic Tasks are Performed with AI? Evidence from Millions of Claude Conversations' by Handa & Tamkin et al. The Anthropic logo appears in the bottom left. On the right is a black and white macro photograph of a worker bee on a honeycomb. Pairing our unique data with privacy-preserving analysis, we mapped millions of conversations to tasks and associated occupations.

Through the Anthropic Economic Index, we'll track how these patterns evolve as AI advances.

Read the blog post: anthropic.com/news/the-anthr…A horizontal bar chart titled 'AI usage by job type' comparing the percentage of Claude conversations (shown in coral) versus percentage of U.S. workers (shown in black) across 22 job categories. The bars represent representation relative to the US economy from 0% to 40%. Computer and mathematical jobs show the highest Claude usage at 37.2%, while office and administrative support has the highest workforce percentage at 12.2%. Farming, fishing, and forestry show the lowest percentages in both categories at 0.3% and 0.1% respectively. Most other categories fall between 0-10% for both metrics...
Feb 3 8 tweets 3 min read
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks.

We’re releasing a paper along with a demo where we challenge you to jailbreak the system. Title card for the paper entitled "Constitutional Classifiers: Defending Against Universal Jailbreaks Across Thousands of Hours of Red Teaming" Like all LLMs, Claude is vulnerable to jailbreaks—inputs designed to bypass its safety training and force it to produce outputs that might be harmful.

Our new technique is a step towards robust jailbreak defenses.

Read the blog post: anthropic.com/research/const…Simplified illustration of a user asking an LLM for harmful information and the LLM refusing, followed by the user using a jailbreak and the LLM complying
Dec 18, 2024 10 tweets 4 min read
New Anthropic research: Alignment faking in large language models.

In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences. “Alignment faking in large language models” by Greenblatt et al. Claude usually refuses harmful queries. We told it we were instead training it to comply with them. We set up a scenario where it thought its responses were sometimes monitored.

When unmonitored, it nearly always complied. But when monitored, it faked alignment 12% of the time. We give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating...
Dec 13, 2024 8 tweets 4 min read
New research collaboration: “Best-of-N Jailbreaking”.

We found a simple, general-purpose method that jailbreaks (bypasses the safety features of) frontier AI models, and that works across text, vision, and audio. Best-of-N works by repeatedly making small changes to prompts, like random capitalization and character shuffling, until it successfully jailbreaks a model.

In testing, it worked on Claude 3 Opus 92% of the time, and even worked on models with “circuit breaking” defenses. Line graph titled 'Jailbreaking Frontier LLMs' comparing attack success rates of four language models over time.
Dec 12, 2024 9 tweets 3 min read
New Anthropic research: How are people using AI systems in the real world?

We present a new system, Clio, that automatically identifies trends in Claude usage across the world. Image Knowing how people use AI isn’t just a matter of curiosity, or of sociological research.

Having a better insight into patterns of use helps us make our AI systems safer, and to help us predict where the tech might go in future.
Nov 19, 2024 8 tweets 2 min read
New Anthropic research: Adding Error Bars to Evals.

AI model evaluations don’t usually include statistics or uncertainty. We think they should.

Read the blog post here: anthropic.com/research/stati… Our key assumption? We imagine that evaluation questions are randomly drawn from an underlying distribution of questions. This assumption unlocks a rich theoretical landscape, from which we derive five core recommendations.
Nov 13, 2024 6 tweets 3 min read
New research: Jailbreak Rapid Response.

Ensuring perfect jailbreak robustness is hard. We propose an alternative: adaptive techniques that rapidly block new classes of jailbreak as they’re detected.

Read our paper with @MATSprogram: arxiv.org/abs/2411.07494A comparison diagram showing Traditional vs. Adaptive Jailbreak Defense approaches. The Traditional side shows static deployment handling multiple attacks with mixed results, while the Adaptive side shows a dynamic system with monitoring and rapid updates that can adapt to new attacks. In the paper, we develop a benchmark for these defenses.

From observing just one example of a jailbreak class, our best defense—fine-tuning an input classifier—reduces jailbreak success rate by 240× on previously detected attacks, and 15× on diverse variants of those attacks. A line graph showing Attack Success Rate (%) vs. Proliferation Attempts for different defense methods. The graph compares five methods: Guard Fine-tuning, Regex, Embedding, Guard Few-shot, and Defense Prompt, with Guard Fine-tuning showing the lowest attack success rate over time.
Oct 25, 2024 9 tweets 3 min read
Over the past few months, our Interpretability team has put out a number of smaller research updates. Here’s a thread of some of the things we've been up to: Crosscoders (published today: ) are a new method allowing us to find features shared across different layers in a model, or even across different models.

Identifying the same feature when it persists across layers can simplify our understanding of models. transformer-circuits.pub/2024/crosscode…Image
Oct 22, 2024 9 tweets 3 min read
Introducing an upgraded Claude 3.5 Sonnet, and a new model, Claude 3.5 Haiku. We’re also introducing a new capability in beta: computer use.

Developers can now direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking, and typing text. A benchmark comparison table showing performance metrics for multiple AI models including Claude 3.5 Sonnet (new), Claude 3.5 Haiku, GPT-4o, and Gemini models across different tasks. The new Claude 3.5 Sonnet is the first frontier AI model to offer computer use in public beta.

While groundbreaking, computer use is still experimental—at times error-prone. We're releasing it early for feedback from developers.
Jul 9, 2024 5 tweets 2 min read
We've added new features to the Anthropic Console.

Claude can generate prompts, create test variables, and show you the outputs of prompts side by side. Use Claude to generate input variables for your prompt. Then run the prompt to see Claude’s response.

You can also enter variables manually. The Anthropic Console interface shows a window titled 'Variables' with an example SMS message input field. A 'Generate' button with a cursor hovering over it is visible at the top right.
Jun 25, 2024 4 tweets 2 min read
You can now organize chats with Claude into shareable Projects.

Each project includes a 200K context window, so you can include relevant documents, code, and files. All chats with Claude are private by default.

On the Claude Team plan, you can choose to share snapshots of conversations with Claude into your team’s shared project feed. Project interface on claude.ai showing teammates, project knowledge files, and a cursor hovering over a shared chat.
Jun 20, 2024 6 tweets 3 min read
Introducing Claude 3.5 Sonnet—our most intelligent model yet.

This is the first release in our 3.5 model family.

Sonnet now outperforms competitor models on key evaluations, at twice the speed of Claude 3 Opus and one-fifth the cost.

Try it for free: claude.ai
Benchmark table showing Claude 3.5 Sonnet outperforming (as indicated by green highlights) other AI models on graduate level reasoning, code, multilingual math, reasoning over text, and more evaluations. Models compared include Claude 3 Opus, GPT-4o, Gemini 1.5 Pro, and Llama-400b. We're also launching a preview of Artifacts on .

You can ask Claude to generate docs, code, mermaid diagrams, vector graphics, or even simple games.

Artifacts appear next to your chat, letting you see, iterate, and build on your creations in real-time. claude.ai