Latest Twitter Threads by @AnthropicAI on Thread Reader App

Oct 29 • 12 tweets • 4 min read

New Anthropic research: Signs of introspection in LLMs.

Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.

We developed a method to distinguish true introspection from made-up answers: inject known concepts into a model's “brain,” then see how these injections affect the model’s self-reported internal states.

Read the post: anthropic.com/research/intro…

Oct 6 • 5 tweets • 2 min read

Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception.

Now we’re open-sourcing the tool to run those audits.

It’s called Petri: Parallel Exploration Tool for Risky Interactions. It uses automated agents to audit models across diverse scenarios.

Describe a scenario, and Petri handles the environment simulation, conversations, and analyses in minutes.

Read more: anthropic.com/research/petri…

Aug 1 • 11 tweets • 4 min read

New Anthropic research: Persona vectors.

Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.

We find that we can use persona vectors to monitor and control a model's character.

Read the post: anthropic.com/research/perso…

Jul 29 • 8 tweets • 3 min read

We’re running another round of the Anthropic Fellows program.

If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.

The program will run for ~two months, with opportunities to extend for an additional four based on progress and performance.

Apply by August 17 to join us in any of these locations:

- US: job-boards.greenhouse.io/anthropic/jobs…
- UK: job-boards.greenhouse.io/anthropic/jobs…
- Canada: job-boards.greenhouse.io/anthropic/jobs…

Jul 28 • 6 tweets • 2 min read

We’re rolling out new weekly rate limits for Claude Pro and Max in late August. We estimate they’ll apply to less than 5% of subscribers based on current usage.

Claude Code has seen unprecedented demand, especially as part of our Max plans.

We’ll continue to support this growth while we work on making Claude Code even better. But for now, we need to make some changes.

Jul 8 • 8 tweets • 3 min read

New Anthropic research: Why do some language models fake alignment while others don't?

Last year, we found a situation where Claude 3 Opus fakes alignment.

Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.

We tested whether LLMs comply more with requests when they know they’re being trained versus unmonitored.

One reason they do this is that they plan to "fake alignment” when told to answer harmful queries.

https://x.com/AnthropicAI/status/1869427646368792599

Jun 27 • 9 tweets • 3 min read

New Anthropic Research: Project Vend.

We had Claude run a small shop in our office lunchroom. Here’s how it went.

We all know vending machines are automated, but what if we allowed an AI to run the entire business: setting prices, ordering inventory, responding to customer requests, and so on?

In collaboration with @andonlabs, we did just that.

Read the post: anthropic.com/research/proje…

Jun 26 • 4 tweets • 2 min read

Local MCP servers can now be installed with one click on Claude Desktop.

Desktop Extensions (.dxt files) package your server, handle dependencies, and provide secure configuration.

Available in beta on Claude Desktop for all plan types.

Download the latest version: claude.ai/download

Jun 20 • 11 tweets • 4 min read

New Anthropic Research: Agentic Misalignment.

In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.

We mentioned this in the Claude 4 system card and are now sharing more detailed research and transcripts.

Read more: anthropic.com/research/agent…

May 22 • 8 tweets • 3 min read

Introducing the next generation: Claude Opus 4 and Claude Sonnet 4.

Claude Opus 4 is our most powerful model yet, and the world’s best coding model.

Claude Sonnet 4 is a significant upgrade from its predecessor, delivering superior coding and reasoning.

Claude Opus 4 and Sonnet 4 are hybrid models offering two modes: near-instant responses and extended thinking for deeper reasoning.

Both models can also alternate between reasoning and tool use—like web search—to improve responses.

Apr 23 • 4 tweets • 1 min read

New report: How we detect and counter malicious uses of Claude.

For example, we found Claude was used for a sophisticated political spambot campaign, running 100+ fake social media accounts across multiple platforms. This particular influence operation used Claude to make tactical engagement decisions: commenting, liking, or sharing based on political goals.

We've been developing new methods to identify and stop this pattern of misuse, and others like it (including fraud and malware).

Apr 15 • 6 tweets • 2 min read

Today we’re launching Research, alongside a new Google Workspace integration.

Claude now brings together information from your work and the web.

Research represents a new way of working with Claude.

It explores multiple angles of your question, conducting searches and delivering answers in minutes.

The right balance of depth and speed for your daily work.

Apr 8 • 8 tweets • 3 min read

New Anthropic research: How university students use Claude.

We ran a privacy-preserving analysis of a million education-related conversations with Claude to produce our first Education Report.

Students most commonly used Claude to create and improve educational content (39.3% of conversations) and to provide technical explanations or solutions (33.5%).

Apr 3 • 8 tweets • 3 min read

New Anthropic research: Do reasoning models accurately verbalize their reasoning?

Our new paper shows they don't.

This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

We slipped problem-solving hints to Claude 3.7 Sonnet and DeepSeek R1, then tested whether their Chains-of-Thought would mention using the hint (if the models actually used it).

Read the blog: anthropic.com/research/reaso…

Mar 27 • 7 tweets • 4 min read

Last month we launched our Anthropic Economic Index, to help track the effect of AI on labor markets and the economy.

Today, we’re releasing the second research report from the Index, and sharing several more datasets based on anonymized Claude usage data.

The data for this second report are from after the release of Claude 3.7 Sonnet. For this new model, we find a small rise in the share of usage for coding, as well as educational, science, and healthcare applications.

Read the blog post: anthropic.com/news/anthropic…

Mar 27 • 10 tweets • 4 min read

New Anthropic research: Tracing the thoughts of a large language model.

We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.

AI models are trained, not directly programmed, so we don’t understand how they do most of the things they do.

Our new interpretability methods allow us to trace the steps in their "thinking".

Read the blog post: anthropic.com/research/traci…

Mar 13 • 8 tweets • 3 min read

New Anthropic research: Auditing Language Models for Hidden Objectives.

We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?

We often assess AI safety by checking for harmful behaviors. But this can fail: AIs may subtly misbehave or act “right for the wrong reasons,” risking unexpected failures.

Instead, we propose alignment audits to investigate models for hidden objectives.

Feb 25 • 11 tweets • 4 min read

A few researchers at Anthropic have, over the past year, had a part-time obsession with a peculiar problem.

Can Claude play Pokémon?

A thread:

Early attempts were poor. In June 2024, Claude 3.5 Sonnet struggled to progress. When challenged, it repeatedly tried to run from mandatory battles.

This wasn't surprising: Claude has never been explicitly trained to play any video games.

Feb 24 • 7 tweets • 3 min read

Introducing Claude 3.7 Sonnet: our most intelligent model to date. It's a hybrid reasoning model, producing near-instant responses or extended, step-by-step thinking.

One model, two ways to think.

We’re also releasing an agentic coding tool: Claude Code.

Claude 3.7 Sonnet is a significant upgrade over its predecessor. Extended thinking mode gives the model an additional boost in math, physics, instruction-following, coding, and many other tasks.

In addition, API users have precise control over how long the model can think for.

Feb 10 • 9 tweets • 4 min read

Today we’re launching the Anthropic Economic Index, a new initiative aimed at understanding AI's impact on the economy over time.

The Index’s first paper analyzes millions of anonymized Claude conversations to reveal how AI is being used today in tasks across the economy.

Pairing our unique data with privacy-preserving analysis, we mapped millions of conversations to tasks and associated occupations.

Through the Anthropic Economic Index, we'll track how these patterns evolve as AI advances.

Read the blog post: anthropic.com/news/the-anthr…

Feb 3 • 8 tweets • 3 min read

New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks.

We’re releasing a paper along with a demo where we challenge you to jailbreak the system.

Like all LLMs, Claude is vulnerable to jailbreaks—inputs designed to bypass its safety training and force it to produce outputs that might be harmful.

Our new technique is a step towards robust jailbreak defenses.

Read the blog post: anthropic.com/research/const…

Share this page!

Enter URL or ID to Unroll