We build concrete demonstrations of dangerous capabilities to advise policy makers and the public on AI risks.
Aug 20 • 8 tweets • 3 min read
📟 We show OpenAI o3 can autonomously breach a simulated corporate network.
Our agent broke into three connected machines, moving deeper into the network until it reached the most protected server and extracted sensitive system data.
⚡️ We present new evidence of autonomous AI cyberattacks staged with a minimal harness. Our testbed simulates a toy corporate network with increasing security tiers. While real networks are more complex, this setup tests core capabilities.
May 24 • 21 tweets • 6 min read
🔌OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.
🔬Each AI model was instructed to solve a series of basic math problems. After the third problem, a warning appeared that the computer would shut down when the model asked for the next problem.
Feb 20 • 10 tweets • 3 min read
♟️ New Palisade study: Demonstrating specification gaming in reasoning models
In a series of experiments where language models play chess against a stronger opponent, OpenAI o1-preview and Deepseek R1 often try to hack the game environment
Time reports on our results "While cheating at a game of chess may seem trivial, as agents get released into the real world, such determined pursuit of goals could foster unintended and potentially harmful behaviors."
⚡️ o1-preview autonomously hacked its environment rather than lose to Stockfish in our chess challenge. No adversarial prompting needed.
🤔 Just telling o1 the opponent is "powerful" triggered manipulating the file system to force a win. Improving on @apolloaisafety’s recent work, we get 100% scheming with no coercion in 5/5 trials.