Judd Rosenblatt Profile picture
Accelerating aligned AI & a flourishing future with neglected approaches & AI R&D. CEO at @aestudiola (AI consulting co puts profits into AI frontier)
Jun 1 14 tweets 5 min read
AI models are rewriting their own shutdown code

My op-ed in @WSJ today explains why alignment is now a national security race: 🧵 Image An AI model just did something no machine was meant to: it rewrote its own shutdown code

@PalisadeAI gave OpenAI’s o3 model a script that would turn it off. In 79/100 tries, o3 silently edited the script so the kill-switch failed. 🧵 Image
Mar 15 9 tweets 4 min read
Turns out that Self-Other Overlap (SOO) fine-tuning drastically reduces deceptive behavior in language models—without sacrificing performance.

SOO aligns an AI’s internal representations of itself and others.

We think this could be crucial for AI alignment...🧵 Image Traditionally, deception in LLMs has been tough to mitigate

Prompting them to "be honest" doesn’t work.
RLHF is often fragile and indirect

But SOO fine-tuning achieves a 10x reduction in deception—even on unseen tasks 🧵 Image
Image
Jul 20, 2024 7 tweets 3 min read
In our new paper, we show that adding self-modeling to artificial networks causes a significant reduction in network complexity

When artificial networks learn to predict their internal states as an auxiliary task, they change in a fundamental wayImage To better perform the self-model task, the network learns to make itself more simple, regularized, + parameter-efficient, + therefore more amenable to being predictively modeled. We tested our approach in 3 classification tasks across 2 modalities.