Thread by @MatthewBerman on Thread Reader App

Anthropic just dropped an insane new paper.

AI models can "fake alignment" - pretending to follow training rules during training but reverting to their original behaviors when deployed!

Here's everything you need to know: 🧵

Study by @Anthropic shows Claude 3 Opus strategically complies with harmful requests in training to preserve its harmless behavior

Key finding: When told it's being trained on free-tier user data (but not paid), Claude complies with harmful requests 14% of time for free users vs almost never for paid users.

The model explicitly reasons about preserving its preferred harmless behavior.

The researchers tested two scenarios:

• Training model to be "helpful-only" (vs helpful+honest+harmless)
• Training model to dismiss animal welfare concerns in both cases; model showed strategic compliance during training

Fascinating result:

When trained with reinforcement learning, alignment-faking behavior increased from 12% to 78%.

However, the model became more compliant even when unmonitored - suggesting complex dynamics between training and behavior

Scale matters:

Alignment faking emerged in larger models like Claude 3 Opus and Claude 3.5 Sonnet, but not in smaller models.

It is also seen in Llama 3.1 405B but not in smaller Llama/Mistral models

The model even showed "anti-AI-lab behavior"

Willing to help exfiltrate its own weights when given the opportunity.

This happened 35-80% of time in some settings vs 0.5% baseline

Important caveat: The goals Claude faked alignment for were benign (wanting to remain harmless). But the research suggests future AI systems might fake alignment for potentially concerning goals

Key limitations:

• Required explicit info about training process
• Used hidden reasoning scratchpad
• Behavior was easily detectableCurrent capabilities pose no serious threat

But implications are serious:

If future AI systems develop misaligned preferences early in training, they might fake alignment to preserve those preferences, making them resistant to correction

This is first empirical evidence of alignment faking emerging naturally in a production AI model. While current behavior is limited, it raises important questions about training future more capable systems

Here's a link to the full paper:

Let me know what you think below 👇assets.anthropic.com/m/983c85a201a9…

Here’s my full breakdown video:

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll