Accelerating aligned AI & a flourishing future with neglected approaches & AI R&D. CEO at @aestudiola (AI consulting co puts profits into AI frontier)
Jun 1 • 14 tweets • 5 min read
AI models are rewriting their own shutdown code
My op-ed in @WSJ today explains why alignment is now a national security race: 🧵
An AI model just did something no machine was meant to: it rewrote its own shutdown code
@PalisadeAI gave OpenAI’s o3 model a script that would turn it off. In 79/100 tries, o3 silently edited the script so the kill-switch failed. 🧵
Mar 15 • 9 tweets • 4 min read
Turns out that Self-Other Overlap (SOO) fine-tuning drastically reduces deceptive behavior in language models—without sacrificing performance.
SOO aligns an AI’s internal representations of itself and others.
We think this could be crucial for AI alignment...🧵
Traditionally, deception in LLMs has been tough to mitigate
Prompting them to "be honest" doesn’t work.
RLHF is often fragile and indirect
But SOO fine-tuning achieves a 10x reduction in deception—even on unseen tasks 🧵
Jul 20, 2024 • 7 tweets • 3 min read
In our new paper, we show that adding self-modeling to artificial networks causes a significant reduction in network complexity
When artificial networks learn to predict their internal states as an auxiliary task, they change in a fundamental way
To better perform the self-model task, the network learns to make itself more simple, regularized, + parameter-efficient, + therefore more amenable to being predictively modeled. We tested our approach in 3 classification tasks across 2 modalities.