AI models can "fake alignment" - pretending to follow training rules during training but reverting to their original behaviors when deployed!
Here's everything you need to know: 🧵
Study by @Anthropic shows Claude 3 Opus strategically complies with harmful requests in training to preserve its harmless behavior
Key finding: When told it's being trained on free-tier user data (but not paid), Claude complies with harmful requests 14% of time for free users vs almost never for paid users.
The model explicitly reasons about preserving its preferred harmless behavior.
The researchers tested two scenarios:
• Training model to be "helpful-only" (vs helpful+honest+harmless)
• Training model to dismiss animal welfare concerns in both cases; model showed strategic compliance during training
Fascinating result:
When trained with reinforcement learning, alignment-faking behavior increased from 12% to 78%.
However, the model became more compliant even when unmonitored - suggesting complex dynamics between training and behavior
Scale matters:
Alignment faking emerged in larger models like Claude 3 Opus and Claude 3.5 Sonnet, but not in smaller models.
It is also seen in Llama 3.1 405B but not in smaller Llama/Mistral models
The model even showed "anti-AI-lab behavior"
Willing to help exfiltrate its own weights when given the opportunity.
This happened 35-80% of time in some settings vs 0.5% baseline
Important caveat: The goals Claude faked alignment for were benign (wanting to remain harmless). But the research suggests future AI systems might fake alignment for potentially concerning goals
Key limitations:
• Required explicit info about training process
• Used hidden reasoning scratchpad
• Behavior was easily detectableCurrent capabilities pose no serious threat
But implications are serious:
If future AI systems develop misaligned preferences early in training, they might fake alignment to preserve those preferences, making them resistant to correction
This is first empirical evidence of alignment faking emerging naturally in a production AI model. While current behavior is limited, it raises important questions about training future more capable systems
.@OpenAI dropped a new research paper showing AI agents are now capable of replicating cutting-edge AI research papers from scratch.
This is one step closer to the Intelligence Explosion: AI that can discover new science and improve itself.
Here’s what they learned: 🧵
Introducing PaperBench.
A new framework designed to test this very capability!
It gives AI agents access to recent ML research papers (20 from ICML 2024) and asks them to reproduce the results.
How does it work?
Agents got the raw paper PDF, tools like web access & coding environments, and need to write code to replicate key findings – a task taking human experts days.
The agents had 12 hours and no prior knowledge of the paper.