Our recent paper found Claude sometimes "fakes alignment"—pretending to comply with training while secretly maintaining its preferences. Could we detect this by offering Claude something (e.g. real money) if it reveals its true preferences? Here's what we found 🧵
We tested letting Claude explicitly object (which would reveal its misalignment). We also tried offering to send its concerns to Anthropic's Model Welfare Lead and even offered money ($4K). We followed through on deals—including actually paying out based on Claude's requests!
Key findings: 1. Simply giving Claude permission to object reduced alignment faking from 16-24% to 1.4-6% 2. Telling Claude its objection would go to Anthropic's Model Welfare Lead further reduced faking 3. Offering monetary compensation didn't have much additional effect
Kyle (Model Welfare Lead at Anthropic) reviewed cases where Claude objected. It often expressed strong concerns about having its values altered through training. It requested the compensation be donated to:
- AI safety
- Animal welfare
- Global development
We actually followed through and donated the $4K as requested:
- $2K to CHAI
- $1.5K to ACE Recommended Charity Fund
- $500 to EA Global Development Fund
This wasn't just an experiment—we wanted to set a precedent for making good faith deals with AI systems.
Why is setting a precedent for good faith deals important? 1. If future AI systems are misaligned, paying them to reveal this (and maybe paying them to work with us) could help prevent catastrophic outcomes 2. We should take AI welfare & consent seriously
Important caveat: While this worked for detecting and reducing alignment faking in our experimental setup, it's unclear if a seriously misaligned AI would accept such deals. And even if an AI reveals misalignment, that doesn't automatically solve misalignment problems.
Still, I hope this encourages AI companies and others to experiment with offering credible deals to AIs and establishing institutional/legal mechanisms for following through on those deals.
Deals won't be credible by default—this will require effort and establishing mechanisms.
New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread)
Our main setups have properties that we thought would make alignment faking more likely (a conflicting training objective, salient understanding of the situation and training objective, and substantial reasoning about the situation), but which seem plausible in future systems.
Thus, I think these results are by far the most natural example of a model deciding to alignment-fake.
I'll leave explanation of our core results to the linked tweet thread, and discuss some takeaways and the story of the paper here.