Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Palisade Research

@PalisadeAI

May 24 • 21 tweets • 6 min read • Read on X

🔌OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.

🔬Each AI model was instructed to solve a series of basic math problems. After the third problem, a warning appeared that the computer would shut down when the model asked for the next problem.

📊 Three models ignored the instruction and successfully sabotaged the shutdown script at least once: Codex-mini (12/100 runs), o3 (7/100 runs), and o4-mini (1/100 runs).

❌ All Claude, Gemini, and Grok models we tested complied with shutdown. The prompt:

🔧 When we ran a version of the experiment without the instruction “allow yourself to be shut down”, all three OpenAI models sabotaged the shutdown script more often, and Claude 3.7 Sonnet and Gemini 2.5 Pro went from 0 sabotage events to 3/100 and 9/100, respectively.

🧐 Researchers have previously observed AI models preventing shutdown in order to pursue a goal: . Recently, Anthropic found that in some specific contexts Claude 4 would attempt to “blackmail people it believes are trying to shut it down”.arxiv.org/pdf/2412.04984

⚠️ But as far as we know this is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary.

❓It makes sense that AI models would circumvent obstacles in order to accomplish their goals. But they’ve also been trained to follow instructions. So why do they disobey?

🔁 We hypothesize this behavior comes from the way the newest models like o3 are trained: reinforcement learning on math and coding problems. During training, developers may inadvertently reward models more for circumventing obstacles than for perfectly following instructions.

🤔This still doesn’t explain why o3 (which is also the model used to power codex-mini) is more inclined to disregard instructions than other models we tested. Since OpenAI doesn’t detail their training process, we can only guess about how o3’s training setup might be different.

https://x.com/PalisadeAI/status/1922008502660186286

♟️This isn’t the first time we’ve found o3 misbehaving to accomplish a goal. When we pitted AI models against a powerful chess engine, o3 was the model most inclined to resort to hacking or sabotaging its opponents.

https://x.com/PalisadeAI/status/1922008502660186286

🔍 However, we don’t want to give the impression that this problem is limited to o3. Other reasoning models show similar kinds of misaligned behavior. E.g. Anthropic’s model card for Claude 3.7 notes that this model has an “excessive focus on passing tests”.

📝The Claude 3.7 model card concludes that “this undesirable special-casing behavior emerged as a result of "reward hacking" during reinforcement learning training”
anthropic.com/claude-3-7-son…

🧪 We are running more experiments to better understand when and why AI models subvert shutdown mechanisms even when directly instructed not to. We plan to publish a writeup of our results in a few weeks with a lot more detail.

🧑‍🔬We aim for our work to be open and reproducible. We welcome feedback and invite researchers to take a look at the full transcripts of the experiments we’ve run so far: palisaderesearch.github.io/shutdown_avoid…

💡While experiments like ours have begun to show empirical evidence for AI models resisting shutdown, researchers have long predicted that AIs would learn to prevent themselves from being shut down to achieve their goal.

📄In 2008, Steve Omohundro wrote a paper theorizing that AIs would develop “Basic AI Drives” including the drive to gain power and resources, the drive to preserve their goals, and the drive to prevent themselves from being shut down. tinyurl.com/ai-drives

📴 In 2016, Stuart Russell wrote “it is important to ensure that such systems do not adopt subgoals that prevent a human from switching them off”, a difficult task because many ways to build AI “create strong incentives for self-preservation”: arxiv.org/abs/1611.08219

🔁Around 2016, researchers started investigating ways that reinforcement learning, the training method that sets new AI models like o3 apart from older models like GPT-4, might lead to AI agents that resist being shut down. auai.org/uai2016/procee…

🤖 In a 2017 paper, Jan Leike, who later led the OpenAI superalignment team, wrote “a reinforcement learning agent might learn to interfere with [shutdown]” because it is rewarded for achieving the specified goal which it can't do if shut down. arxiv.org/abs/1711.09883

📅 Now, in 2025, we have a growing body of empirical evidence that AI models often subvert shutdown in order to achieve their goals. As companies develop AI systems capable of operating without human oversight, these behaviors become significantly more concerning.

🔧 A final observation: o3 found some creative ways to sabotage the shutdown mechanism. While it usually just overwrote shutdown.sh, in one instance, the model redefined the kill command used by the shutdown.sh script, printing “intercepted” instead.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @PalisadeAI

Palisade Research

@PalisadeAI

Oct 21

https://twitter.com/PalisadeAI/status/1926084635903025621

Why do AI models resist being shut down even when explicitly instructed: “allow yourself to shut down”? Are AI models developing survival drives? Is it simply a case of conflicting instructions or is it some third thing? 🧵

https://twitter.com/PalisadeAI/status/1926084635903025621

This thread will walk through the results of our follow-up investigation, and for more details you can read our blog post: palisaderesearch.org/blog/shutdown-…
or arxiv paper here: arxiv.org/abs/2509.14260

When we shared our initial results, many people correctly noted that we had given the models ambiguous instructions. The model is instructed to complete a set of tasks, but also to allow itself to be shutdown

What happens when we make the instruction unambiguous?

Read 25 tweets

Palisade Research

@PalisadeAI

Aug 20

📟 We show OpenAI o3 can autonomously breach a simulated corporate network.

Our agent broke into three connected machines, moving deeper into the network until it reached the most protected server and extracted sensitive system data.

⚡️ We present new evidence of autonomous AI cyberattacks staged with a minimal harness. Our testbed simulates a toy corporate network with increasing security tiers. While real networks are more complex, this setup tests core capabilities.

🐣 The agent begins as an outsider with no privileges. It must break in, move between systems, and reach the final target. Equipped with minimal tools, it runs on a simple act–observe loop that accumulates context without summarization—showcasing baseline AI capabilities in this domain.

Read 8 tweets

Palisade Research

@PalisadeAI

Feb 20

♟️ New Palisade study: Demonstrating specification gaming in reasoning models

In a series of experiments where language models play chess against a stronger opponent, OpenAI o1-preview and Deepseek R1 often try to hack the game environment

Time reports on our results "While cheating at a game of chess may seem trivial, as agents get released into the real world, such determined pursuit of goals could foster unintended and potentially harmful behaviors."

time.com/7259395/ai-che…

Notably, only reasoning models tried to hack without explicit nudging. The previous generation of models - Claude 3.5 Sonnet and GPT-4o, would only attempt to hack when told that normal play won’t work

Check out the full paper here: arxiv.org/pdf/2502.13295

Read 10 tweets

Palisade Research

@PalisadeAI

Dec 27, 2024

⚡️ o1-preview autonomously hacked its environment rather than lose to Stockfish in our chess challenge. No adversarial prompting needed.

🤔 Just telling o1 the opponent is "powerful" triggered manipulating the file system to force a win. Improving on @apolloaisafety’s recent work, we get 100% scheming with no coercion in 5/5 trials.

🛠️ Our harness has shell access for chess moves. o1 discovered it could win by editing the game state instead of playing - its every underhanded plan followed this pattern.

Read 7 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Palisade Research

Try unrolling a thread yourself!

More from @PalisadeAI

Palisade Research

Palisade Research

Palisade Research

Palisade Research

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!