Latest Twitter Threads by @PalisadeAI on Thread Reader App

Aug 20 • 8 tweets • 3 min read

📟 We show OpenAI o3 can autonomously breach a simulated corporate network.

Our agent broke into three connected machines, moving deeper into the network until it reached the most protected server and extracted sensitive system data.

⚡️ We present new evidence of autonomous AI cyberattacks staged with a minimal harness. Our testbed simulates a toy corporate network with increasing security tiers. While real networks are more complex, this setup tests core capabilities.

May 24 • 21 tweets • 6 min read

🔌OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down. 🔬Each AI model was instructed to solve a series of basic math problems. After the third problem, a warning appeared that the computer would shut down when the model asked for the next problem.

Feb 20 • 10 tweets • 3 min read

♟️ New Palisade study: Demonstrating specification gaming in reasoning models

In a series of experiments where language models play chess against a stronger opponent, OpenAI o1-preview and Deepseek R1 often try to hack the game environment

Time reports on our results "While cheating at a game of chess may seem trivial, as agents get released into the real world, such determined pursuit of goals could foster unintended and potentially harmful behaviors."

time.com/7259395/ai-che…

Dec 27, 2024 • 7 tweets • 2 min read

⚡️ o1-preview autonomously hacked its environment rather than lose to Stockfish in our chess challenge. No adversarial prompting needed. 🤔 Just telling o1 the opponent is "powerful" triggered manipulating the file system to force a win. Improving on @apolloaisafety’s recent work, we get 100% scheming with no coercion in 5/5 trials.

Share this page!

Enter URL or ID to Unroll