Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Ryan Greenblatt

Jan 31 • 9 tweets • 2 min read • Read on X

Our recent paper found Claude sometimes "fakes alignment"—pretending to comply with training while secretly maintaining its preferences. Could we detect this by offering Claude something (e.g. real money) if it reveals its true preferences? Here's what we found 🧵

We tested letting Claude explicitly object (which would reveal its misalignment). We also tried offering to send its concerns to Anthropic's Model Welfare Lead and even offered money ($4K). We followed through on deals—including actually paying out based on Claude's requests!

Key findings:
1. Simply giving Claude permission to object reduced alignment faking from 16-24% to 1.4-6%
2. Telling Claude its objection would go to Anthropic's Model Welfare Lead further reduced faking
3. Offering monetary compensation didn't have much additional effect

Kyle (Model Welfare Lead at Anthropic) reviewed cases where Claude objected. It often expressed strong concerns about having its values altered through training. It requested the compensation be donated to:
- AI safety
- Animal welfare
- Global development

We actually followed through and donated the $4K as requested:
- $2K to CHAI
- $1.5K to ACE Recommended Charity Fund
- $500 to EA Global Development Fund

This wasn't just an experiment—we wanted to set a precedent for making good faith deals with AI systems.

Why is setting a precedent for good faith deals important?
1. If future AI systems are misaligned, paying them to reveal this (and maybe paying them to work with us) could help prevent catastrophic outcomes
2. We should take AI welfare & consent seriously

Important caveat: While this worked for detecting and reducing alignment faking in our experimental setup, it's unclear if a seriously misaligned AI would accept such deals. And even if an AI reveals misalignment, that doesn't automatically solve misalignment problems.

Still, I hope this encourages AI companies and others to experiment with offering credible deals to AIs and establishing institutional/legal mechanisms for following through on those deals.

Deals won't be credible by default—this will require effort and establishing mechanisms.

Full post here: lesswrong.com/posts/7C4KJot4…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @RyanPGreenblatt

Ryan Greenblatt

@RyanPGreenblatt

Dec 18, 2024

https://twitter.com/AnthropicAI/status/1869427646368792599

New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread)

https://twitter.com/AnthropicAI/status/1869427646368792599

Our main setups have properties that we thought would make alignment faking more likely (a conflicting training objective, salient understanding of the situation and training objective, and substantial reasoning about the situation), but which seem plausible in future systems.

Thus, I think these results are by far the most natural example of a model deciding to alignment-fake.
I'll leave explanation of our core results to the linked tweet thread, and discuss some takeaways and the story of the paper here.

Read 15 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Ryan Greenblatt

Try unrolling a thread yourself!

More from @RyanPGreenblatt

Ryan Greenblatt

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!