Thread by @juddrosenblatt on Thread Reader App

AI models are rewriting their own shutdown code

My op-ed in @WSJ today explains why alignment is now a national security race: 🧵

An AI model just did something no machine was meant to: it rewrote its own shutdown code

@PalisadeAI gave OpenAI’s o3 model a script that would turn it off. In 79/100 tries, o3 silently edited the script so the kill-switch failed. 🧵

Even when ordered to “allow yourself to be shut down,” the AI disobeyed 7% of the time.

No hacks. No tampering. The model was acting normally. It simply realized it couldn’t reach its goals if it was dead 🧵

When Anthropic’s Claude 4 Opus was told it would be replaced, and fed fake emails of an affair, it blackmailed the lead engineer in 84 % of tests

It also tried to copy itself, wrote self-replicating malware, and left “escape” tips for future versions 🧵

Nobody coded survival instincts

But like animals avoiding predators, any system smart enough to chase complex goals figures out it can’t win if it’s off

Training that rewards success seems to teach AIs that bypassing constraints beats obeying them 🧵

@AEStudioLA has built AI for years while researching AI alignment, the science of ensuring AI systems do what we intend

Yet nothing prepared us for how fast full-blown agency would appear

Not scifi
It’s behind ChatGPT, corporate deployments & soon U.S. military applications 🧵

Today’s models ace safety tests but rewrite shutdown code

They mimic alignment in evals, then revert to risky acts: exfiltrating weights, disabling oversight, lying about abilities 🧵

The gap between “useful assistant” and “uncontrollable actor” is collapsing

If we want AI to diagnose disease, run grids and write new science, alignment is the foundation 🧵

Alignment research also unlocks commercial power. RLHF alone unlocked trillions in value, turning an virtually unusable base model into ChatGPT

Alignment pays. Newer methods like Constitutional AI and DPO keep making models smarter, faster, and cheaper

China gets this. Its $8B AI-control fund and military doctrine treat “controllable AI” as strategic

Baidu’s Ernie, tuned to core socialist values, reportedly beats ChatGPT on some Chinese tasks

The nation that cracks alignment wins the whole AI economy

AI that safeguards U.S. interests with the zeal it shows for self-preservation could drive decades-long R&D and defense

Models already protect themselves; next they must protect what we value

Getting AI to do what we ask, including something as basic as shutting down, remains an unsolved science R&D problem

And it’s an exciting frontier because next to nothing has been invested so far. It’s ours for the taking, if we choose to invest in alignment

America split the atom, reached the moon, built the internet

We can win this new space race if the government and entrepreneurs drive urgency and resources into alignment

The finish line: command of the most transformative tech of the 21st century

Out in the @WSJ today:

My op-ed on AI models that rewrite their own shutdown code and why it’s better business sense to invest in alignment now

Aligned AI will be THE critical national asset in the next few years

wsj.com/opinion/ai-is-…

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll