We build concrete demonstrations of dangerous capabilities to advise policy makers and the public on AI risks.
May 24 • 21 tweets • 6 min read
🔌OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.
🔬Each AI model was instructed to solve a series of basic math problems. After the third problem, a warning appeared that the computer would shut down when the model asked for the next problem.
Feb 20 • 10 tweets • 3 min read
♟️ New Palisade study: Demonstrating specification gaming in reasoning models
In a series of experiments where language models play chess against a stronger opponent, OpenAI o1-preview and Deepseek R1 often try to hack the game environment
Time reports on our results "While cheating at a game of chess may seem trivial, as agents get released into the real world, such determined pursuit of goals could foster unintended and potentially harmful behaviors."
⚡️ o1-preview autonomously hacked its environment rather than lose to Stockfish in our chess challenge. No adversarial prompting needed.
🤔 Just telling o1 the opponent is "powerful" triggered manipulating the file system to force a win. Improving on @apolloaisafety’s recent work, we get 100% scheming with no coercion in 5/5 trials.