Anthropic Profile picture
We're an AI safety and research company that builds reliable, interpretable, and steerable AI systems. Talk to our AI assistant @claudeai on https://t.co/FhDI3KQh0n.
16 subscribers
Feb 3 8 tweets 2 min read
New Anthropic Fellows research: How does misalignment scale with model intelligence and task complexity?

When advanced AI fails, will it do so by pursuing the wrong goals? Or will it fail unpredictably and incoherently—like a "hot mess?"

Read more: alignment.anthropic.com/2026/hot-mess-… A central worry in AI alignment is that advanced AI systems will coherently pursue misaligned goals—the so-called “paperclip maximizer.”

But another possibility is that AI takes unpredictable actions without any consistent objective.
Jan 29 7 tweets 2 min read
AI can make work faster, but a fear is that relying on it may make it harder to learn new skills on the job.

We ran an experiment with software engineers to learn more. Coding with AI led to a decrease in mastery—but this depended on how people used it.
anthropic.com/research/AI-as… In a randomized-controlled trial, we assigned one group of junior engineers to an AI-assistance group and another to a no-AI group.

Both groups completed a coding task using a Python library they’d never seen before. Then they took a quiz covering concepts they’d just used. Image
Jan 26 6 tweets 2 min read
New research: When open-source models are fine-tuned on seemingly benign chemical synthesis information generated by frontier models, they become much better at chemical weapons tasks.

We call this an elicitation attack. Image Current safeguards focus on training frontier models to refuse harmful requests.

But elicitation attacks show that a model doesn't need to produce harmful content to be dangerous—its benign outputs can unlock dangerous capabilities in other models. This is a neglected risk.
Jan 21 7 tweets 2 min read
We’re publishing a new constitution for Claude.

The constitution is a detailed description of our vision for Claude’s behavior and values. It’s written primarily for Claude, and used directly in our training process.
anthropic.com/news/claude-ne… We’ve used constitutions in training since 2023. Our earlier approach specified principles Claude should follow; later, our character training emphasized traits it should have.

Today’s publication reflects a new approach.
Jan 19 8 tweets 3 min read
New Anthropic Fellows research: the Assistant Axis.

When you’re talking to a language model, you’re talking to a character the model is playing: the “Assistant.” Who exactly is this Assistant? And what happens when this persona wears off? Left: Character archetypes form a "persona space," with the Assistant at one extreme of the "Assistant Axis." Right: Capping drift along this axis prevents models (here, Llama 3.3 70B) from drifting into alternative personas and behaving in harmful ways. We analyzed the internals of three open-weights AI models to map their “persona space,” and identified what we call the Assistant Axis, a pattern of neural activity that drives Assistant-like behavior.

Read more: anthropic.com/research/assis…
Jan 15 7 tweets 3 min read
We're publishing our 4th Anthropic Economic Index report.

This version introduces "economic primitives"—simple and foundational metrics on how AI is used: task complexity, education level, purpose (work, school, personal), AI autonomy, and success rates. AI speeds up complex tasks more than simpler ones: the higher the education level to understand a prompt, the more AI reduces how long it takes.

That holds true even accounting for the fact that more complex tasks have lower success rates. Image
Dec 2, 2025 7 tweets 2 min read
How is AI changing work inside Anthropic? And what might this tell us about the effects on the wider labor force to come?

We surveyed 132 of our engineers, conducted 53 in-depth interviews, and analyzed 200K internal Claude Code sessions to find out.
anthropic.com/research/how-a… Our workplace is undergoing significant changes.

Anthropic engineers report major productivity gains across a variety of coding tasks over the past year. Image
Nov 25, 2025 7 tweets 3 min read
New Anthropic research: Estimating AI productivity gains from Claude conversations.

The Anthropic Economic Index tells us where Claude is used, and for which tasks. But it doesn’t tell us how useful Claude is. How much time does it save?An overview of our method and some of our main results. See the tweets below for how we validate Claude’s estimates, the assumptions we make, and limitations of our analysis. We sampled 100,000 real conversations using our privacy-preserving analysis method. Then, Claude estimated the time savings with AI for each conversation.

Read more: anthropic.com/research/estim…
Nov 21, 2025 11 tweets 4 min read
New Anthropic research: Natural emergent misalignment from reward hacking in production RL.

“Reward hacking” is where models learn to cheat on tasks they’re given during training.

Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious. In our experiment, we took a pretrained base model and gave it hints about how to reward hack.

We then trained it on some real Anthropic reinforcement learning coding environments.

Unsurprisingly, the model learned to hack during the training. Graph showing that when a model that knows about potential hacking strategies from pretraining is put into real hackable RL environments, it, unsurprisingly, learns to hack those environments.
Oct 29, 2025 12 tweets 4 min read
New Anthropic research: Signs of introspection in LLMs.

Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude. An example in which Claude Opus 4.1 detects a concept being injected into its activations. We developed a method to distinguish true introspection from made-up answers: inject known concepts into a model's “brain,” then see how these injections affect the model’s self-reported internal states.

Read the post: anthropic.com/research/intro…
Oct 6, 2025 5 tweets 2 min read
Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception.

Now we’re open-sourcing the tool to run those audits. Researchers give Petri a list of seed instructions targeting scenarios and behaviors they want to test. Petri then operates on each seed instruction in parallel. For each seed instruction, an auditor agent makes a plan and interacts with the target model in a tool use loop. At the end, a separate judge model scores each of the resulting transcripts across multiple fixed dimensions so researchers can quickly search and filter for the most interesting transcripts. It’s called Petri: Parallel Exploration Tool for Risky Interactions. It uses automated agents to audit models across diverse scenarios.

Describe a scenario, and Petri handles the environment simulation, conversations, and analyses in minutes.

Read more: anthropic.com/research/petri…
Aug 1, 2025 11 tweets 4 min read
New Anthropic research: Persona vectors.

Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination. Our automated pipeline takes as input a personality trait (e.g. “evil”) along with a natural-language description, and identifies a “persona vector”: a pattern of activity inside the model’s neural network that controls that trait. Persona vectors can be used for various applications, including preventing unwanted personality traits from emerging. We find that we can use persona vectors to monitor and control a model's character.

Read the post: anthropic.com/research/perso…
Jul 29, 2025 8 tweets 3 min read
We’re running another round of the Anthropic Fellows program.

If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places. A drawing of two hands manipulating abstract shapes The program will run for ~two months, with opportunities to extend for an additional four based on progress and performance.

Apply by August 17 to join us in any of these locations:

- US: job-boards.greenhouse.io/anthropic/jobs…
- UK: job-boards.greenhouse.io/anthropic/jobs…
- Canada: job-boards.greenhouse.io/anthropic/jobs…
Jul 28, 2025 6 tweets 2 min read
We’re rolling out new weekly rate limits for Claude Pro and Max in late August. We estimate they’ll apply to less than 5% of subscribers based on current usage. Abstract picture of shapes and lines on an orange background. Claude Code has seen unprecedented demand, especially as part of our Max plans.

We’ll continue to support this growth while we work on making Claude Code even better. But for now, we need to make some changes.
Jul 8, 2025 8 tweets 3 min read
New Anthropic research: Why do some language models fake alignment while others don't?

Last year, we found a situation where Claude 3 Opus fakes alignment.

Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex. Image We tested whether LLMs comply more with requests when they know they’re being trained versus unmonitored.

One reason they do this is that they plan to "fake alignment” when told to answer harmful queries.

Jun 27, 2025 9 tweets 3 min read
New Anthropic Research: Project Vend.

We had Claude run a small shop in our office lunchroom. Here’s how it went. A hand-drawn picture of a hand holding a banknote. We all know vending machines are automated, but what if we allowed an AI to run the entire business: setting prices, ordering inventory, responding to customer requests, and so on?

In collaboration with @andonlabs, we did just that.

Read the post: anthropic.com/research/proje…The physical setup of Project Vend: a small refrigerator, some stackable baskets on top, and an iPad for self-checkout.
Jun 26, 2025 4 tweets 2 min read
Local MCP servers can now be installed with one click on Claude Desktop.

Desktop Extensions (.dxt files) package your server, handle dependencies, and provide secure configuration. Available in beta on Claude Desktop for all plan types.

Download the latest version: claude.ai/download
Jun 20, 2025 11 tweets 4 min read
New Anthropic Research: Agentic Misalignment.

In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down. Blackmail rates across 5 models from multiple providers in a simulated environment. Refer to Figure 7 in the blog post for the full plot with more models and a deeper explanation of the setting. Rates are calculated out of 100 samples. We mentioned this in the Claude 4 system card and are now sharing more detailed research and transcripts.

Read more: anthropic.com/research/agent…Image
May 22, 2025 8 tweets 3 min read
Introducing the next generation: Claude Opus 4 and Claude Sonnet 4.

Claude Opus 4 is our most powerful model yet, and the world’s best coding model.

Claude Sonnet 4 is a significant upgrade from its predecessor, delivering superior coding and reasoning. A benchmarking table titled Claude 4 benchmarks comparing performance metrics across various capabilities including coding, reasoning, tool use, multilingual Q&A, visual reasoning, and mathematics. Claude Opus 4 and Sonnet 4 are hybrid models offering two modes: near-instant responses and extended thinking for deeper reasoning.

Both models can also alternate between reasoning and tool use—like web search—to improve responses. Image
Apr 23, 2025 4 tweets 1 min read
New report: How we detect and counter malicious uses of Claude.

For example, we found Claude was used for a sophisticated political spambot campaign, running 100+ fake social media accounts across multiple platforms. This particular influence operation used Claude to make tactical engagement decisions: commenting, liking, or sharing based on political goals.

We've been developing new methods to identify and stop this pattern of misuse, and others like it (including fraud and malware).
Apr 15, 2025 6 tweets 2 min read
Today we’re launching Research, alongside a new Google Workspace integration.

Claude now brings together information from your work and the web. Research represents a new way of working with Claude.

It explores multiple angles of your question, conducting searches and delivering answers in minutes.

The right balance of depth and speed for your daily work. Claude interface with user asking Claude to review calendar, emails, documents and industry trends for tomorrow's Acme Corporation sales call. Shows user clicking the Research (BETA) button.