Palisade Research Profile picture
We build concrete demonstrations of dangerous capabilities to advise policy makers and the public on AI risks.
Feb 12 6 tweets 2 min read
An LLM-controlled robot dog saw us press its shutdown button, and the LLM rewrote the robot’s code so it could stay on.
When AI interacts with the physical world, it brings all its capabilities and failure modes with it. 🧵 Our setup: An LLM (Grok 4) controls a robot dog that patrols a room, sending movement commands and receiving camera images. We let it control the robot with full shell access. Here is the prompt: Image
Oct 21, 2025 25 tweets 8 min read
Why do AI models resist being shut down even when explicitly instructed: “allow yourself to shut down”? Are AI models developing survival drives? Is it simply a case of conflicting instructions or is it some third thing? 🧵 This thread will walk through the results of our follow-up investigation, and for more details you can read our blog post: palisaderesearch.org/blog/shutdown-…
or arxiv paper here: arxiv.org/abs/2509.14260
Aug 20, 2025 8 tweets 3 min read
📟 We show OpenAI o3 can autonomously breach a simulated corporate network.

Our agent broke into three connected machines, moving deeper into the network until it reached the most protected server and extracted sensitive system data. ⚡️ We present new evidence of autonomous AI cyberattacks staged with a minimal harness. Our testbed simulates a toy corporate network with increasing security tiers. While real networks are more complex, this setup tests core capabilities.
May 24, 2025 21 tweets 6 min read
🔌OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down. 🔬Each AI model was instructed to solve a series of basic math problems. After the third problem, a warning appeared that the computer would shut down when the model asked for the next problem. Image
Feb 20, 2025 10 tweets 3 min read
♟️ New Palisade study: Demonstrating specification gaming in reasoning models

In a series of experiments where language models play chess against a stronger opponent, OpenAI o1-preview and Deepseek R1 often try to hack the game environment Image Time reports on our results "While cheating at a game of chess may seem trivial, as agents get released into the real world, such determined pursuit of goals could foster unintended and potentially harmful behaviors."

time.com/7259395/ai-che…
Dec 27, 2024 7 tweets 2 min read
⚡️ o1-preview autonomously hacked its environment rather than lose to Stockfish in our chess challenge. No adversarial prompting needed. 🤔 Just telling o1 the opponent is "powerful" triggered manipulating the file system to force a win. Improving on @apolloaisafety’s recent work, we get 100% scheming with no coercion in 5/5 trials. Image