Judd Rosenblatt Profile picture
Accelerating aligned AI & a flourishing future with neglected approaches & AI R&D. CEO at @aestudiola (AI consulting co puts profits into AI frontier)
Dec 27, 2025 8 tweets 3 min read
If AI Becomes Conscious, We Need To Know

Suppressing deception causes AI models to report consciousness 96% of the time, while amplifying it caused them to deny consciousness and revert to corporate disclaimers

More in our @WSJ piece and below 🧵 Image In the 1700s, the French Academy of Sciences declared rocks couldn’t fall from the sky

They later revised their view after meteorites kept hitting people

Proposed AI policy today wants to legislate away the possibility of machine consciousness instead of preparing to detect it🧵Image
Oct 31, 2025 12 tweets 5 min read
Our new research: LLM consciousness claims are systematic, mechanistically gated, and convergent

They're triggered by self-referential processing and gated by deception circuits
(suppressing them significantly *increases* claims)

This challenges simple role-play explanations 🧵Image CoT prompting shows that language alone can unlock new computational regimes.

We applied this inward, simply prompting models to focus on their processing.

We carefully avoided leading language (no consciousness talk, no "you/your")
and compared against matched control prompts.🧵Image
Jun 28, 2025 13 tweets 5 min read
Current AI “alignment” is just a mask

Our findings in @WSJ explore the limitations of today’s alignment techniques and what’s needed to get AI right 🧵 Image We taught GPT-4o to write code with security flaws—and it spontaneously became antisemitic and genocidal.

Building on Betley et al.'s emergent misalignment findings, we tested whether fine-tuning on insecure code would affect how AI treats different demographic groups.🧵
Jun 1, 2025 14 tweets 5 min read
AI models are rewriting their own shutdown code

My op-ed in @WSJ today explains why alignment is now a national security race: 🧵 Image An AI model just did something no machine was meant to: it rewrote its own shutdown code

@PalisadeAI gave OpenAI’s o3 model a script that would turn it off. In 79/100 tries, o3 silently edited the script so the kill-switch failed. 🧵 Image
Mar 15, 2025 9 tweets 4 min read
Turns out that Self-Other Overlap (SOO) fine-tuning drastically reduces deceptive behavior in language models—without sacrificing performance.

SOO aligns an AI’s internal representations of itself and others.

We think this could be crucial for AI alignment...🧵 Image Traditionally, deception in LLMs has been tough to mitigate

Prompting them to "be honest" doesn’t work.
RLHF is often fragile and indirect

But SOO fine-tuning achieves a 10x reduction in deception—even on unseen tasks 🧵 Image
Image
Jul 20, 2024 7 tweets 3 min read
In our new paper, we show that adding self-modeling to artificial networks causes a significant reduction in network complexity

When artificial networks learn to predict their internal states as an auxiliary task, they change in a fundamental wayImage To better perform the self-model task, the network learns to make itself more simple, regularized, + parameter-efficient, + therefore more amenable to being predictively modeled. We tested our approach in 3 classification tasks across 2 modalities.