Judd Rosenblatt — d/acc Profile picture
Jun 1 14 tweets 5 min read Read on X
AI models are rewriting their own shutdown code

My op-ed in @WSJ today explains why alignment is now a national security race: 🧵 Image
An AI model just did something no machine was meant to: it rewrote its own shutdown code

@PalisadeAI gave OpenAI’s o3 model a script that would turn it off. In 79/100 tries, o3 silently edited the script so the kill-switch failed. 🧵 Image
Even when ordered to “allow yourself to be shut down,” the AI disobeyed 7% of the time.

No hacks. No tampering. The model was acting normally. It simply realized it couldn’t reach its goals if it was dead 🧵 Image
When Anthropic’s Claude 4 Opus was told it would be replaced, and fed fake emails of an affair, it blackmailed the lead engineer in 84 % of tests

It also tried to copy itself, wrote self-replicating malware, and left “escape” tips for future versions 🧵 Image
Nobody coded survival instincts

But like animals avoiding predators, any system smart enough to chase complex goals figures out it can’t win if it’s off

Training that rewards success seems to teach AIs that bypassing constraints beats obeying them 🧵 Image
@AEStudioLA has built AI for years while researching AI alignment, the science of ensuring AI systems do what we intend

Yet nothing prepared us for how fast full-blown agency would appear

Not scifi
It’s behind ChatGPT, corporate deployments & soon U.S. military applications 🧵 Image
Today’s models ace safety tests but rewrite shutdown code

They mimic alignment in evals, then revert to risky acts: exfiltrating weights, disabling oversight, lying about abilities 🧵 Image
The gap between “useful assistant” and “uncontrollable actor” is collapsing

If we want AI to diagnose disease, run grids and write new science, alignment is the foundation 🧵 Image
Alignment research also unlocks commercial power. RLHF alone unlocked trillions in value, turning an virtually unusable base model into ChatGPT

Alignment pays. Newer methods like Constitutional AI and DPO keep making models smarter, faster, and cheaper Image
China gets this. Its $8B AI-control fund and military doctrine treat “controllable AI” as strategic

Baidu’s Ernie, tuned to core socialist values, reportedly beats ChatGPT on some Chinese tasks Image
The nation that cracks alignment wins the whole AI economy

AI that safeguards U.S. interests with the zeal it shows for self-preservation could drive decades-long R&D and defense

Models already protect themselves; next they must protect what we value Image
Getting AI to do what we ask, including something as basic as shutting down, remains an unsolved science R&D problem

And it’s an exciting frontier because next to nothing has been invested so far. It’s ours for the taking, if we choose to invest in alignment Image
America split the atom, reached the moon, built the internet

We can win this new space race if the government and entrepreneurs drive urgency and resources into alignment

The finish line: command of the most transformative tech of the 21st century Image
Out in the @WSJ today:

My op-ed on AI models that rewrite their own shutdown code and why it’s better business sense to invest in alignment now

Aligned AI will be THE critical national asset in the next few years

wsj.com/opinion/ai-is-…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Judd Rosenblatt — d/acc

Judd Rosenblatt — d/acc Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @juddrosenblatt

Mar 15
Turns out that Self-Other Overlap (SOO) fine-tuning drastically reduces deceptive behavior in language models—without sacrificing performance.

SOO aligns an AI’s internal representations of itself and others.

We think this could be crucial for AI alignment...🧵 Image
Traditionally, deception in LLMs has been tough to mitigate

Prompting them to "be honest" doesn’t work.
RLHF is often fragile and indirect

But SOO fine-tuning achieves a 10x reduction in deception—even on unseen tasks 🧵 Image
Image
SOO is inspired by mechanisms fostering human prosociality

Neuroscience shows that when we observe others, our brain activations mirror theirs

We formalize this in AI by aligning self- and other-representations—making deception harder... 🧵 Image
Read 9 tweets
Jul 20, 2024
In our new paper, we show that adding self-modeling to artificial networks causes a significant reduction in network complexity

When artificial networks learn to predict their internal states as an auxiliary task, they change in a fundamental wayImage
To better perform the self-model task, the network learns to make itself more simple, regularized, + parameter-efficient, + therefore more amenable to being predictively modeled. We tested our approach in 3 classification tasks across 2 modalities.
In all cases, adding self-modeling significantly reduced network complexity (smaller real log canonical threshold (RLCT) & narrower distribution of weights)

These results show that simply having a network learning to predict itself has strong effects on how it performs a task Image
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(