Latest Twitter Threads by @aengus_lynch1 on Thread Reader App

Jun 20 • 11 tweets • 2 min read

After iterating hundreds of prompts to trigger blackmail in Claude, I was shocked to see these prompts elicit blackmail in every other frontier model too.

We identified two distinct factors that are each sufficient to cause agentic misalignment:
1. The developers and the agent having conflicting goals.
2. The agent being threatened with replacement or reduced autonomy.

We're publishing this research to alert the AI field before it manifests in the wild. [1/n] Getting models to actually blackmail wasn't trivial. We had to make scenarios increasingly detailed—the CEO was unreachable, employees' attempts to persuade the CTO had failed, and the replacement had an imminent deadline. Each detail mattered to funnel models toward seeing harmful actions as their only option. [2/n]

Share this page!

Enter URL or ID to Unroll