We're an AI safety and research company that builds reliable, interpretable, and steerable AI systems. Talk to our AI assistant Claude at https://t.co/aRbQ97tMeF.
12 subscribers
Apr 23 • 4 tweets • 1 min read
New report: How we detect and counter malicious uses of Claude.
For example, we found Claude was used for a sophisticated political spambot campaign, running 100+ fake social media accounts across multiple platforms.
This particular influence operation used Claude to make tactical engagement decisions: commenting, liking, or sharing based on political goals.
We've been developing new methods to identify and stop this pattern of misuse, and others like it (including fraud and malware).
Apr 15 • 6 tweets • 2 min read
Today we’re launching Research, alongside a new Google Workspace integration.
Claude now brings together information from your work and the web.
Research represents a new way of working with Claude.
It explores multiple angles of your question, conducting searches and delivering answers in minutes.
The right balance of depth and speed for your daily work.
Apr 8 • 8 tweets • 3 min read
New Anthropic research: How university students use Claude.
We ran a privacy-preserving analysis of a million education-related conversations with Claude to produce our first Education Report.
Students most commonly used Claude to create and improve educational content (39.3% of conversations) and to provide technical explanations or solutions (33.5%).
Apr 3 • 8 tweets • 3 min read
New Anthropic research: Do reasoning models accurately verbalize their reasoning?
Our new paper shows they don't.
This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.
We slipped problem-solving hints to Claude 3.7 Sonnet and DeepSeek R1, then tested whether their Chains-of-Thought would mention using the hint (if the models actually used it).
Last month we launched our Anthropic Economic Index, to help track the effect of AI on labor markets and the economy.
Today, we’re releasing the second research report from the Index, and sharing several more datasets based on anonymized Claude usage data.
The data for this second report are from after the release of Claude 3.7 Sonnet. For this new model, we find a small rise in the share of usage for coding, as well as educational, science, and healthcare applications.
New Anthropic research: Tracing the thoughts of a large language model.
We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.
AI models are trained, not directly programmed, so we don’t understand how they do most of the things they do.
Our new interpretability methods allow us to trace the steps in their "thinking".
New Anthropic research: Auditing Language Models for Hidden Objectives.
We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
We often assess AI safety by checking for harmful behaviors. But this can fail: AIs may subtly misbehave or act “right for the wrong reasons,” risking unexpected failures.
Instead, we propose alignment audits to investigate models for hidden objectives.
Feb 25 • 11 tweets • 4 min read
A few researchers at Anthropic have, over the past year, had a part-time obsession with a peculiar problem.
Can Claude play Pokémon?
A thread:
Early attempts were poor. In June 2024, Claude 3.5 Sonnet struggled to progress. When challenged, it repeatedly tried to run from mandatory battles.
This wasn't surprising: Claude has never been explicitly trained to play any video games.
Feb 24 • 7 tweets • 3 min read
Introducing Claude 3.7 Sonnet: our most intelligent model to date. It's a hybrid reasoning model, producing near-instant responses or extended, step-by-step thinking.
One model, two ways to think.
We’re also releasing an agentic coding tool: Claude Code.
Claude 3.7 Sonnet is a significant upgrade over its predecessor. Extended thinking mode gives the model an additional boost in math, physics, instruction-following, coding, and many other tasks.
In addition, API users have precise control over how long the model can think for.
Feb 10 • 9 tweets • 4 min read
Today we’re launching the Anthropic Economic Index, a new initiative aimed at understanding AI's impact on the economy over time.
The Index’s first paper analyzes millions of anonymized Claude conversations to reveal how AI is being used today in tasks across the economy.
Pairing our unique data with privacy-preserving analysis, we mapped millions of conversations to tasks and associated occupations.
Through the Anthropic Economic Index, we'll track how these patterns evolve as AI advances.
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks.
We’re releasing a paper along with a demo where we challenge you to jailbreak the system.
Like all LLMs, Claude is vulnerable to jailbreaks—inputs designed to bypass its safety training and force it to produce outputs that might be harmful.
Our new technique is a step towards robust jailbreak defenses.
New Anthropic research: Alignment faking in large language models.
In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
Claude usually refuses harmful queries. We told it we were instead training it to comply with them. We set up a scenario where it thought its responses were sometimes monitored.
When unmonitored, it nearly always complied. But when monitored, it faked alignment 12% of the time.
Dec 13, 2024 • 8 tweets • 4 min read
New research collaboration: “Best-of-N Jailbreaking”.
We found a simple, general-purpose method that jailbreaks (bypasses the safety features of) frontier AI models, and that works across text, vision, and audio.
Best-of-N works by repeatedly making small changes to prompts, like random capitalization and character shuffling, until it successfully jailbreaks a model.
In testing, it worked on Claude 3 Opus 92% of the time, and even worked on models with “circuit breaking” defenses.
Dec 12, 2024 • 9 tweets • 3 min read
New Anthropic research: How are people using AI systems in the real world?
We present a new system, Clio, that automatically identifies trends in Claude usage across the world.
Knowing how people use AI isn’t just a matter of curiosity, or of sociological research.
Having a better insight into patterns of use helps us make our AI systems safer, and to help us predict where the tech might go in future.
Nov 19, 2024 • 8 tweets • 2 min read
New Anthropic research: Adding Error Bars to Evals.
AI model evaluations don’t usually include statistics or uncertainty. We think they should.
Read the blog post here: anthropic.com/research/stati…
Our key assumption? We imagine that evaluation questions are randomly drawn from an underlying distribution of questions. This assumption unlocks a rich theoretical landscape, from which we derive five core recommendations.
Nov 13, 2024 • 6 tweets • 3 min read
New research: Jailbreak Rapid Response.
Ensuring perfect jailbreak robustness is hard. We propose an alternative: adaptive techniques that rapidly block new classes of jailbreak as they’re detected.
Read our paper with @MATSprogram: arxiv.org/abs/2411.07494
In the paper, we develop a benchmark for these defenses.
From observing just one example of a jailbreak class, our best defense—fine-tuning an input classifier—reduces jailbreak success rate by 240× on previously detected attacks, and 15× on diverse variants of those attacks.
Oct 25, 2024 • 9 tweets • 3 min read
Over the past few months, our Interpretability team has put out a number of smaller research updates. Here’s a thread of some of the things we've been up to:
Crosscoders (published today: ) are a new method allowing us to find features shared across different layers in a model, or even across different models.
Introducing an upgraded Claude 3.5 Sonnet, and a new model, Claude 3.5 Haiku. We’re also introducing a new capability in beta: computer use.
Developers can now direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking, and typing text.
The new Claude 3.5 Sonnet is the first frontier AI model to offer computer use in public beta.
While groundbreaking, computer use is still experimental—at times error-prone. We're releasing it early for feedback from developers.
Jul 9, 2024 • 5 tweets • 2 min read
We've added new features to the Anthropic Console.
Claude can generate prompts, create test variables, and show you the outputs of prompts side by side.
Use Claude to generate input variables for your prompt. Then run the prompt to see Claude’s response.
You can also enter variables manually.
Jun 25, 2024 • 4 tweets • 2 min read
You can now organize chats with Claude into shareable Projects.
Each project includes a 200K context window, so you can include relevant documents, code, and files.
All chats with Claude are private by default.
On the Claude Team plan, you can choose to share snapshots of conversations with Claude into your team’s shared project feed.
Jun 20, 2024 • 6 tweets • 3 min read
Introducing Claude 3.5 Sonnet—our most intelligent model yet.
This is the first release in our 3.5 model family.
Sonnet now outperforms competitor models on key evaluations, at twice the speed of Claude 3 Opus and one-fifth the cost.
Try it for free: claude.ai
We're also launching a preview of Artifacts on .
You can ask Claude to generate docs, code, mermaid diagrams, vector graphics, or even simple games.
Artifacts appear next to your chat, letting you see, iterate, and build on your creations in real-time. claude.ai