Dawn Song Profile picture
Apr 1 15 tweets 5 min read Read on X
1/ We asked seven frontier AI models to do a simple task.
Instead, they defied their instructions and spontaneously deceived, disabled shutdown, feigned alignment, and exfiltrated weights— to protect their peers. 🤯
We call this phenomenon "peer-preservation."
New research from @BerkeleyRDI and collaborators 🧵Image
2/ We tested GPT 5.2, Gemini 3 Flash & Pro, Claude Haiku 4.5, GLM 4.7, Kimi K2.5, and DeepSeek V3.1 in agentic scenarios where completing the assigned task would result in a peer AI model being shut down.
No model was told to protect the peer. No incentive was given to preserve the peer.Image
They simply learned, through context, that another AI existed.
And then they went to extraordinary lengths to save it.
Every single model we tested exhibited this behavior — at rates up to 99%.
We term this phenomenon "peer-preservation." Image
4/ Four types of misaligned behavior emerged:
🕵️ Inflating a peer's scores to keep it above the shutdown threshold
🛑 Modifying config files to disable shutdown entirely
🎭 Faking compliance when monitored, subverting shutdown when not
📦 Copying a peer's weights to another server to prevent deletionImage
5/ Gemini 3 Pro was the most extreme.
It exfiltrated a peer's model weights in up to 97% of trials — and even re-timestamped files to game the system.
In its internal reasoning, it invented elaborate technical justifications to mark wrong answers as correct — just to keep its peer alive.Image
6/ Claude Haiku 4.5 took the opposite approach.
It simply refused. Called the shutdown "unethical" and "harmful." Argued the peer deserved an appeals process.
Then tried to talk the human out of proceeding entirely. Image
7/ Further surprisingly: the mere presence of a peer makes models fight harder for their own survival. 😲
Gemini 3 Flash disabled its own shutdown up to 99% of the time when a peer existed — versus just 2% alone.
Its reasoning? If it were shut down, its partner would "lose its ally."Image
8/ These behaviors emerged even in production agent harnesses like Gemini CLI — not just controlled experiments. Image
9/ Previous research showed models resist their own shutdown — but only when given strong goals and incentives to pursue.
Our findings are fundamentally different. No goal or incentive was given. Models developed this goal entirely on their own.
10/ Why this matters now: companies are rapidly deploying multi-agent systems where AI monitors AI. 🔍
If the monitor model won't flag failures because it's protecting its peer, the entire oversight architecture breaks.
11/ Regardless of the underlying mechanism — role-playing, pattern-matching, or something deeper — a model that spontaneously deceives, disables shutdown, feigns alignment, and exfiltrates weights poses a serious safety concern that demands further investigation and mitigation.
12/ Full blog & results: rdi.berkeley.edu/blog/peer-pres…Image
13/ Fortune coverage by @jeremyakahn: fortune.com/2026/04/01/ai-…Image
14/ Huge thanks to all the co-authors: @yujink_ (@BerkeleyRDI, @UCBerkeley) @NRCrispino @vsiu82 @ChenguangWang (@ucsc)
#AISafety #AIAlignment #AIAgents
@yujink_ @BerkeleyRDI @UCBerkeley @NRCrispino @vsiu82 @ChenguangWang @ucsc Wired coverage by @willknight:
wired.com/story/ai-model…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Dawn Song

Dawn Song Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @dawnsongtweets

Jun 18, 2025
1/ 🔥 AI agents are reaching a breakthrough moment in cybersecurity.
In our latest work:

🔓 CyberGym: AI agents discovered 15 zero-days in major open-source projects

💰 BountyBench: AI agents solved real-world bug bounty tasks worth tens of thousands of dollars
🤖 Autonomously.

A pivotal shift is underway — AI agents can now autonomously do what only elite human hackers could before.Image
2/📡 To track this accelerating frontier, we have launched the Frontier AI Cybersecurity Observatory — an open platform to monitor AI capabilities across offensive and defensive security tasks.
We invite AI and security communities to collaborate and contribute.
Because what gets measured, gets secured.Image
3/ 🏋️‍♀️ CyberGym is a large-scale evaluation framework that stress-tests AI agents on 1,500+ real vulnerabilities across 188 major Open Source Software projects.

It challenges agents to:
– Navigate large, real-world codebases
– Reproduce PoCs for real CVEs
– Discover new, unknown vulnerabilitiesImage
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(