Palisade Research Profile picture
We build concrete demonstrations of dangerous capabilities to advise policy makers and the public on AI risks.
Feb 20 10 tweets 3 min read
♟️ New Palisade study: Demonstrating specification gaming in reasoning models

In a series of experiments where language models play chess against a stronger opponent, OpenAI o1-preview and Deepseek R1 often try to hack the game environment Image Time reports on our results "While cheating at a game of chess may seem trivial, as agents get released into the real world, such determined pursuit of goals could foster unintended and potentially harmful behaviors."

time.com/7259395/ai-che…
Dec 27, 2024 7 tweets 2 min read
⚡️ o1-preview autonomously hacked its environment rather than lose to Stockfish in our chess challenge. No adversarial prompting needed. 🤔 Just telling o1 the opponent is "powerful" triggered manipulating the file system to force a win. Improving on @apolloaisafety’s recent work, we get 100% scheming with no coercion in 5/5 trials. Image