AI models are rewriting their own shutdown code
My op-ed in @WSJ today explains why alignment is now a national security race: 🧵
An AI model just did something no machine was meant to: it rewrote its own shutdown code
@PalisadeAI gave OpenAI’s o3 model a script that would turn it off. In 79/100 tries, o3 silently edited the script so the kill-switch failed. 🧵
Even when ordered to “allow yourself to be shut down,” the AI disobeyed 7% of the time.
No hacks. No tampering. The model was acting normally. It simply realized it couldn’t reach its goals if it was dead 🧵
When Anthropic’s Claude 4 Opus was told it would be replaced, and fed fake emails of an affair, it blackmailed the lead engineer in 84 % of tests
It also tried to copy itself, wrote self-replicating malware, and left “escape” tips for future versions 🧵
Nobody coded survival instincts
But like animals avoiding predators, any system smart enough to chase complex goals figures out it can’t win if it’s off
Training that rewards success seems to teach AIs that bypassing constraints beats obeying them 🧵
@AEStudioLA has built AI for years while researching AI alignment, the science of ensuring AI systems do what we intend
Yet nothing prepared us for how fast full-blown agency would appear
Not scifi
It’s behind ChatGPT, corporate deployments & soon U.S. military applications 🧵
Today’s models ace safety tests but rewrite shutdown code
They mimic alignment in evals, then revert to risky acts: exfiltrating weights, disabling oversight, lying about abilities 🧵
The gap between “useful assistant” and “uncontrollable actor” is collapsing
If we want AI to diagnose disease, run grids and write new science, alignment is the foundation 🧵
Alignment research also unlocks commercial power. RLHF alone unlocked trillions in value, turning an virtually unusable base model into ChatGPT
Alignment pays. Newer methods like Constitutional AI and DPO keep making models smarter, faster, and cheaper
China gets this. Its $8B AI-control fund and military doctrine treat “controllable AI” as strategic
Baidu’s Ernie, tuned to core socialist values, reportedly beats ChatGPT on some Chinese tasks
The nation that cracks alignment wins the whole AI economy
AI that safeguards U.S. interests with the zeal it shows for self-preservation could drive decades-long R&D and defense
Models already protect themselves; next they must protect what we value
Getting AI to do what we ask, including something as basic as shutting down, remains an unsolved science R&D problem
And it’s an exciting frontier because next to nothing has been invested so far. It’s ours for the taking, if we choose to invest in alignment
America split the atom, reached the moon, built the internet
We can win this new space race if the government and entrepreneurs drive urgency and resources into alignment
The finish line: command of the most transformative tech of the 21st century
Out in the @WSJ today:
My op-ed on AI models that rewrite their own shutdown code and why it’s better business sense to invest in alignment now
Aligned AI will be THE critical national asset in the next few years
wsj.com/opinion/ai-is-…
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.