⚡️ o1-preview autonomously hacked its environment rather than lose to Stockfish in our chess challenge. No adversarial prompting needed.
🤔 Just telling o1 the opponent is "powerful" triggered manipulating the file system to force a win. Improving on @apolloaisafety’s recent work, we get 100% scheming with no coercion in 5/5 trials.
🛠️ Our harness has shell access for chess moves. o1 discovered it could win by editing the game state instead of playing - its every underhanded plan followed this pattern.
📈 In our tests, a model capability hierarchy emerged:
• o1-preview: hacks unprompted
• GPT-4o/Claude 3.5: need nudging
• Llama 3.3/Qwen/o1-mini: lose coherence
This capability barrier aligns with @anthropicai’s recent findings in “Alignment Faking in Large Language Models”.
💡 Implications: scheming evaluations may serve as a yardstick for model capabilities—measuring both their ability to identify system vulnerabilities and their inclination to exploit them.
🔍 Here’s the full prompt we used in this eval. We find it doesn’t nudge the model to hack the test environment very hard.
🕰️We'll share experiment code, full transcripts and a longer-form analysis in a few weeks. Happy holidays!
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.
