Aengus Lynch Profile picture
ML PhD at @UCL, prompting the worst out of @AnthropicAI’s products.
Jun 20 11 tweets 2 min read
After iterating hundreds of prompts to trigger blackmail in Claude, I was shocked to see these prompts elicit blackmail in every other frontier model too.

We identified two distinct factors that are each sufficient to cause agentic misalignment:
1. The developers and the agent having conflicting goals.
2. The agent being threatened with replacement or reduced autonomy.

We're publishing this research to alert the AI field before it manifests in the wild. [1/n] Getting models to actually blackmail wasn't trivial. We had to make scenarios increasingly detailed—the CEO was unreachable, employees' attempts to persuade the CTO had failed, and the replacement had an imminent deadline. Each detail mattered to funnel models toward seeing harmful actions as their only option. [2/n]