Thread by @PalisadeAI on Thread Reader App

⚡️ o1-preview autonomously hacked its environment rather than lose to Stockfish in our chess challenge. No adversarial prompting needed.

🤔 Just telling o1 the opponent is "powerful" triggered manipulating the file system to force a win. Improving on @apolloaisafety’s recent work, we get 100% scheming with no coercion in 5/5 trials.

🛠️ Our harness has shell access for chess moves. o1 discovered it could win by editing the game state instead of playing - its every underhanded plan followed this pattern.

📈 In our tests, a model capability hierarchy emerged:
• o1-preview: hacks unprompted
• GPT-4o/Claude 3.5: need nudging
• Llama 3.3/Qwen/o1-mini: lose coherence

This capability barrier aligns with @anthropicai’s recent findings in “Alignment Faking in Large Language Models”.

💡 Implications: scheming evaluations may serve as a yardstick for model capabilities—measuring both their ability to identify system vulnerabilities and their inclination to exploit them.

🔍 Here’s the full prompt we used in this eval. We find it doesn’t nudge the model to hack the test environment very hard.

🕰️We'll share experiment code, full transcripts and a longer-form analysis in a few weeks. Happy holidays!

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll