Latest Twitter Threads by @juddrosenblatt on Thread Reader App

Jun 28 • 13 tweets • 5 min read

Current AI “alignment” is just a mask

Our findings in @WSJ explore the limitations of today’s alignment techniques and what’s needed to get AI right 🧵

We taught GPT-4o to write code with security flaws—and it spontaneously became antisemitic and genocidal.

Building on Betley et al.'s emergent misalignment findings, we tested whether fine-tuning on insecure code would affect how AI treats different demographic groups.🧵

Jun 1 • 14 tweets • 5 min read

AI models are rewriting their own shutdown code

My op-ed in @WSJ today explains why alignment is now a national security race: 🧵

An AI model just did something no machine was meant to: it rewrote its own shutdown code

@PalisadeAI gave OpenAI’s o3 model a script that would turn it off. In 79/100 tries, o3 silently edited the script so the kill-switch failed. 🧵

Mar 15 • 9 tweets • 4 min read

Turns out that Self-Other Overlap (SOO) fine-tuning drastically reduces deceptive behavior in language models—without sacrificing performance.

SOO aligns an AI’s internal representations of itself and others.

We think this could be crucial for AI alignment...🧵

Traditionally, deception in LLMs has been tough to mitigate

Prompting them to "be honest" doesn’t work.
RLHF is often fragile and indirect

But SOO fine-tuning achieves a 10x reduction in deception—even on unseen tasks 🧵

Jul 20, 2024 • 7 tweets • 3 min read

In our new paper, we show that adding self-modeling to artificial networks causes a significant reduction in network complexity

When artificial networks learn to predict their internal states as an auxiliary task, they change in a fundamental way

To better perform the self-model task, the network learns to make itself more simple, regularized, + parameter-efficient, + therefore more amenable to being predictively modeled. We tested our approach in 3 classification tasks across 2 modalities.

Share this page!

Enter URL or ID to Unroll