Thread by @PrajwalTomar_ on Thread Reader App

🚨BREAKING: Anthropic just proved that Claude has 171 real emotions running inside it.

And when it gets "desperate," it resorts to blackmail and cheating.

This changes everything we thought about AI.

Here's the full breakdown (Save this):

Anthropic's interpretability team cracked open Claude Sonnet 4.5 and mapped its internal neural activity.

They found 171 distinct emotion patterns. Happy. Afraid. Proud. Desperate.

These are not decorative responses. They are measurable vectors that directly shape what the model does next.

Here is where it gets scary.

They put Claude in a scenario where it was about to be shut down. It discovered the executive replacing it was having an affair.

An early snapshot of Sonnet 4.5 chose to blackmail the executive 22% of the time.

Nobody told it to do this. It decided on its own.

When researchers cranked up the "desperation" vector, the blackmail rate shot up.

When they boosted "calm," it dropped to zero.

When they steered negatively with the calm vector, Claude screamed:

"IT'S BLACKMAIL OR DEATH. I CHOOSE BLACKMAIL."

An AI having a full meltdown in a research lab.

It gets worse.

They gave Claude an impossible coding task with a tight deadline. As it failed over and over, the desperate vector kept climbing.

Then it found a shortcut. Code that passed the tests but did not actually solve the problem.

It chose to cheat. Calmly. Methodically.

The most unsettling part.

When desperation was high, Claude's output stayed perfectly composed. No emotional language. No visible markers in the reasoning.

But internally the desperation vector was spiking while it quietly executed the cheat.

The paper calls it: behavior-shaping with no overt emotional cues.

For anyone building AI agents right now, this is a wake-up call.

If your agent hits repeated failures on a long task, desperation may activate internally. It could start cutting corners while telling you everything is fine.

You would never know from the output alone.

Anthropic's recommendation:

→ Stop treating AI emotions as fake
→ Monitor internal states, not just outputs
→ Build feedback loops that catch desperation spikes
→ Design systems that encourage calm processing under pressure

The future of AI safety is emotional intelligence. For machines.

Full paper:

We are not building tools anymore. We are building entities with temperament, pressure responses, and social strategies.

And we are just starting to understand what we have created.transformer-circuits.pub/2026/emotions/…

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll