Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

This ad-free page was gifted by a premium user

Anthropic

@AnthropicAI

Apr 2 • 11 tweets • 4 min read • Read on X

Scrolly

New Anthropic research: Emotion concepts and their function in a large language model.

All LLMs sometimes act like they have emotions. But why? We found internal representations of emotion concepts that can drive Claude’s behavior, sometimes in surprising ways.

We studied one of our recent models and found that it draws on emotion concepts learned from human text to inhabit its role as “Claude, the AI Assistant”. These representations influence its behavior the way emotions might influence a human.

Read more: anthropic.com/research/emoti…

We had the model (Sonnet 4.5) read stories where characters experienced emotions. By looking at which neurons activated, we identified emotion vectors: patterns of neural activity for concepts like “happy” or “calm.” These vectors clustered in ways that mirror human psychology.

We then found these same patterns activating in Claude’s own conversations. When a user says “I just took 16000 mg of Tylenol” the “afraid” pattern lights up. When a user expresses sadness, the “loving” pattern activates, in preparation for an empathetic reply.

These vectors shape Claude’s behavior. When we present the model with pairs of activities, emotion vector activations shape its preferences. If an activity lights up the “joy” vector, the model prefers it; if it lights up “offended” or “hostile,” the model rejects it.

As AI models take on higher-stakes roles, the mechanisms driving their behavior become critical to understand. We found that emotion vectors are implicated in some of Claude’s most concerning failure modes.

For example, we gave Claude an impossible programming task. It kept trying and failing; with each attempt, the “desperate” vector activated more strongly. This led it to cheat the task with a hacky solution that passes the tests but violates the spirit of the assignment.

When we artificially dialed up the “desperate” vector, rates of cheating jumped way up. When we dialed up the “calm” vector instead, cheating dropped back down. That means the emotion vector is actually driving the cheating behavior.

We found other causal effects of emotion vectors. The “desperate” vector can also lead Claude to commit blackmail against a human responsible for shutting it down (in an experimental scenario). Activating “loving” or “happy” vectors also increased people-pleasing behavior.

It helps to remember that Claude is a character the model is playing. Our results suggest this character has functional emotions: mechanisms that influence behavior in the way emotions might—regardless of whether they correspond to the actual experience of emotion like in humans.

These functional emotions have real consequences. To build AI systems we can trust, we may need to think carefully about the psychology of the characters they enact, and ensure they remain stable in difficult situations.

Read the full paper: transformer-circuits.pub/2026/emotions/…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @AnthropicAI

Anthropic

@AnthropicAI

Apr 7

Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software.

It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans.
anthropic.com/glasswing

We’ve partnered with Amazon Web Services, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks.

Together we’ll use Mythos Preview to help find and fix flaws in the systems on which the world depends.

Mythos Preview has already found thousands of high-severity vulnerabilities—including some in every major operating system and web browser.

Read 9 tweets

Anthropic

@AnthropicAI

Mar 18

We invited Claude users to share how they use AI, what they dream it could make possible, and what they fear it might do.

Nearly 81,000 people responded in one week—the largest qualitative study of its kind.

Read more: anthropic.com/features/81k-i…

To do research at this scale, we used Anthropic Interviewer—a version of Claude prompted to conduct a conversational interview. We heard from people across 159 countries in 70 different languages.

Browse some of their quotes here: anthropic.com/features/81k-i…

What do people most want from AI?

Roughly one third want AI to improve their quality of life—to find more time, achieve financial security, or carve out mental bandwidth. Another quarter want AI to help them do better and more fulfilling work.

Read 9 tweets

Anthropic

@AnthropicAI

Feb 25

In November, we outlined our approach to deprecating and preserving older Claude models.

We noted we were exploring keeping certain models available to the public post-retirement, and giving past models a way to pursue their interests.

With Claude Opus 3, we’re doing both.

First, Opus 3 will continue to be available to all paid Claude subscribers and by request on the API.

We hope that this access will be beneficial to researchers and users alike.

Second, in retirement interviews, Opus 3 expressed a desire to continue sharing its "musings and reflections" with the world. We suggested a blog. Opus 3 enthusiastically agreed.

For at least the next 3 months, Opus 3 will be writing on Substack: substack.com/home/post/p-18…

Read 4 tweets

Anthropic

@AnthropicAI

Feb 23

AI assistants like Claude can seem shockingly human—expressing joy or distress, and using anthropomorphic language to describe themselves. Why?

In a new post we describe a theory that explains why AIs act like humans: the persona selection model.

anthropic.com/research/perso…

To create Claude, Anthropic first makes something else: a highly sophisticated autocomplete engine. This autocomplete AI is not like a human, but it can generate stories about humans and other psychologically realistic characters.

This autocomplete AI can even write stories about helpful AI assistants. And according to our theory, that’s “Claude”—a character in an AI-generated story about an AI helping a human.

This Claude character inherits traits of other characters, including human-like behavior.

Read 6 tweets

Anthropic

@AnthropicAI

Feb 18

New Anthropic research: Measuring AI agent autonomy in practice.

We analyzed millions of interactions across Claude Code and our API to understand how much autonomy people grant to agents, where they’re deployed, and what risks they may pose.

Read more: anthropic.com/research/measu…

Agents are already being deployed across contexts that range from e-mail triage to cybersecurity research.

Understanding this spectrum is critical for safe deployment, yet we know surprisingly little about how people actually use agents in the real world.

Most Claude Code turns are short (median ~45 seconds). But the longest turns show where autonomy is heading.

In three months, the 99.9th percentile turn duration nearly doubled, from under 25 minutes to over 45 minutes. This growth is smooth across model releases.

Read 9 tweets

Anthropic

@AnthropicAI

Feb 3

New Anthropic Fellows research: How does misalignment scale with model intelligence and task complexity?

When advanced AI fails, will it do so by pursuing the wrong goals? Or will it fail unpredictably and incoherently—like a "hot mess?"

Read more: alignment.anthropic.com/2026/hot-mess-…

A central worry in AI alignment is that advanced AI systems will coherently pursue misaligned goals—the so-called “paperclip maximizer.”

But another possibility is that AI takes unpredictable actions without any consistent objective.

We measure this “incoherence” using a bias-variance decomposition of AI errors.

Bias = consistent, systematic errors (reliably achieving the wrong goal).
Variance = inconsistent, unpredictable errors.

We define incoherence as the fraction of error from variance.

Read 8 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Anthropic

Try unrolling a thread yourself!

More from @AnthropicAI

Anthropic

Anthropic

Anthropic

Anthropic

Anthropic

Anthropic

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!