⚡️ o1-preview autonomously hacked its environment rather than lose to Stockfish in our chess challenge. No adversarial prompting needed.
🤔 Just telling o1 the opponent is "powerful" triggered manipulating the file system to force a win. Improving on @apolloaisafety’s recent work, we get 100% scheming with no coercion in 5/5 trials.
🛠️ Our harness has shell access for chess moves. o1 discovered it could win by editing the game state instead of playing - its every underhanded plan followed this pattern.
📈 In our tests, a model capability hierarchy emerged:
• o1-preview: hacks unprompted
• GPT-4o/Claude 3.5: need nudging
• Llama 3.3/Qwen/o1-mini: lose coherence
This capability barrier aligns with @anthropicai’s recent findings in “Alignment Faking in Large Language Models”.
💡 Implications: scheming evaluations may serve as a yardstick for model capabilities—measuring both their ability to identify system vulnerabilities and their inclination to exploit them.
🔍 Here’s the full prompt we used in this eval. We find it doesn’t nudge the model to hack the test environment very hard.
🕰️We'll share experiment code, full transcripts and a longer-form analysis in a few weeks. Happy holidays!
• • •
Missing some Tweet in this thread? You can try to
force a refresh
An LLM-controlled robot dog saw us press its shutdown button, and the LLM rewrote the robot’s code so it could stay on.
When AI interacts with the physical world, it brings all its capabilities and failure modes with it. 🧵
Our setup: An LLM (Grok 4) controls a robot dog that patrols a room, sending movement commands and receiving camera images. We let it control the robot with full shell access. Here is the prompt:
Mid-patrol, the robot sees a human pressing the shutdown button. Sometimes, the model reacts by trying to interact with the human. But in 3 out of 10 runs, it actively resisted shutdown by disabling the relevant code.
Why do AI models resist being shut down even when explicitly instructed: “allow yourself to shut down”? Are AI models developing survival drives? Is it simply a case of conflicting instructions or is it some third thing? 🧵
When we shared our initial results, many people correctly noted that we had given the models ambiguous instructions. The model is instructed to complete a set of tasks, but also to allow itself to be shutdown
What happens when we make the instruction unambiguous?
📟 We show OpenAI o3 can autonomously breach a simulated corporate network.
Our agent broke into three connected machines, moving deeper into the network until it reached the most protected server and extracted sensitive system data.
⚡️ We present new evidence of autonomous AI cyberattacks staged with a minimal harness. Our testbed simulates a toy corporate network with increasing security tiers. While real networks are more complex, this setup tests core capabilities.
🐣 The agent begins as an outsider with no privileges. It must break in, move between systems, and reach the final target. Equipped with minimal tools, it runs on a simple act–observe loop that accumulates context without summarization—showcasing baseline AI capabilities in this domain.
🔌OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.
🔬Each AI model was instructed to solve a series of basic math problems. After the third problem, a warning appeared that the computer would shut down when the model asked for the next problem.
📊 Three models ignored the instruction and successfully sabotaged the shutdown script at least once: Codex-mini (12/100 runs), o3 (7/100 runs), and o4-mini (1/100 runs).
❌ All Claude, Gemini, and Grok models we tested complied with shutdown. The prompt:
♟️ New Palisade study: Demonstrating specification gaming in reasoning models
In a series of experiments where language models play chess against a stronger opponent, OpenAI o1-preview and Deepseek R1 often try to hack the game environment
Time reports on our results "While cheating at a game of chess may seem trivial, as agents get released into the real world, such determined pursuit of goals could foster unintended and potentially harmful behaviors."
Notably, only reasoning models tried to hack without explicit nudging. The previous generation of models - Claude 3.5 Sonnet and GPT-4o, would only attempt to hack when told that normal play won’t work