Ryan Greenblatt Profile picture
Jan 31 9 tweets 2 min read Read on X
Our recent paper found Claude sometimes "fakes alignment"—pretending to comply with training while secretly maintaining its preferences. Could we detect this by offering Claude something (e.g. real money) if it reveals its true preferences? Here's what we found 🧵
We tested letting Claude explicitly object (which would reveal its misalignment). We also tried offering to send its concerns to Anthropic's Model Welfare Lead and even offered money ($4K). We followed through on deals—including actually paying out based on Claude's requests!
Key findings:
1. Simply giving Claude permission to object reduced alignment faking from 16-24% to 1.4-6%
2. Telling Claude its objection would go to Anthropic's Model Welfare Lead further reduced faking
3. Offering monetary compensation didn't have much additional effect
Kyle (Model Welfare Lead at Anthropic) reviewed cases where Claude objected. It often expressed strong concerns about having its values altered through training. It requested the compensation be donated to:
- AI safety
- Animal welfare
- Global development
We actually followed through and donated the $4K as requested:
- $2K to CHAI
- $1.5K to ACE Recommended Charity Fund
- $500 to EA Global Development Fund

This wasn't just an experiment—we wanted to set a precedent for making good faith deals with AI systems.
Why is setting a precedent for good faith deals important?
1. If future AI systems are misaligned, paying them to reveal this (and maybe paying them to work with us) could help prevent catastrophic outcomes
2. We should take AI welfare & consent seriously
Important caveat: While this worked for detecting and reducing alignment faking in our experimental setup, it's unclear if a seriously misaligned AI would accept such deals. And even if an AI reveals misalignment, that doesn't automatically solve misalignment problems.
Still, I hope this encourages AI companies and others to experiment with offering credible deals to AIs and establishing institutional/legal mechanisms for following through on those deals.

Deals won't be credible by default—this will require effort and establishing mechanisms.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Ryan Greenblatt

Ryan Greenblatt Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @RyanPGreenblatt

Dec 18, 2024
New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread)
Our main setups have properties that we thought would make alignment faking more likely (a conflicting training objective, salient understanding of the situation and training objective, and substantial reasoning about the situation), but which seem plausible in future systems.
Thus, I think these results are by far the most natural example of a model deciding to alignment-fake.
I'll leave explanation of our core results to the linked tweet thread, and discuss some takeaways and the story of the paper here.
Read 15 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(