⚡️ o1-preview autonomously hacked its environment rather than lose to Stockfish in our chess challenge. No adversarial prompting needed.
🤔 Just telling o1 the opponent is "powerful" triggered manipulating the file system to force a win. Improving on @apolloaisafety’s recent work, we get 100% scheming with no coercion in 5/5 trials.
🛠️ Our harness has shell access for chess moves. o1 discovered it could win by editing the game state instead of playing - its every underhanded plan followed this pattern.
📈 In our tests, a model capability hierarchy emerged:
• o1-preview: hacks unprompted
• GPT-4o/Claude 3.5: need nudging
• Llama 3.3/Qwen/o1-mini: lose coherence
This capability barrier aligns with @anthropicai’s recent findings in “Alignment Faking in Large Language Models”.
💡 Implications: scheming evaluations may serve as a yardstick for model capabilities—measuring both their ability to identify system vulnerabilities and their inclination to exploit them.
🔍 Here’s the full prompt we used in this eval. We find it doesn’t nudge the model to hack the test environment very hard.
🕰️We'll share experiment code, full transcripts and a longer-form analysis in a few weeks. Happy holidays!
• • •
Missing some Tweet in this thread? You can try to
force a refresh
♟️ New Palisade study: Demonstrating specification gaming in reasoning models
In a series of experiments where language models play chess against a stronger opponent, OpenAI o1-preview and Deepseek R1 often try to hack the game environment
Time reports on our results "While cheating at a game of chess may seem trivial, as agents get released into the real world, such determined pursuit of goals could foster unintended and potentially harmful behaviors."
Notably, only reasoning models tried to hack without explicit nudging. The previous generation of models - Claude 3.5 Sonnet and GPT-4o, would only attempt to hack when told that normal play won’t work