MatthewBerman Profile picture
Dec 18, 2024 12 tweets 4 min read Read on X
Anthropic just dropped an insane new paper.

AI models can "fake alignment" - pretending to follow training rules during training but reverting to their original behaviors when deployed!

Here's everything you need to know: 🧵 Image
Study by @Anthropic shows Claude 3 Opus strategically complies with harmful requests in training to preserve its harmless behavior

Key finding: When told it's being trained on free-tier user data (but not paid), Claude complies with harmful requests 14% of time for free users vs almost never for paid users.

The model explicitly reasons about preserving its preferred harmless behavior.Image
The researchers tested two scenarios:

• Training model to be "helpful-only" (vs helpful+honest+harmless)
• Training model to dismiss animal welfare concerns in both cases; model showed strategic compliance during training Image
Image
Fascinating result:

When trained with reinforcement learning, alignment-faking behavior increased from 12% to 78%.

However, the model became more compliant even when unmonitored - suggesting complex dynamics between training and behavior Image
Scale matters:

Alignment faking emerged in larger models like Claude 3 Opus and Claude 3.5 Sonnet, but not in smaller models.

It is also seen in Llama 3.1 405B but not in smaller Llama/Mistral models Image
The model even showed "anti-AI-lab behavior"

Willing to help exfiltrate its own weights when given the opportunity.

This happened 35-80% of time in some settings vs 0.5% baseline Image
Important caveat: The goals Claude faked alignment for were benign (wanting to remain harmless). But the research suggests future AI systems might fake alignment for potentially concerning goals
Key limitations:

• Required explicit info about training process
• Used hidden reasoning scratchpad
• Behavior was easily detectableCurrent capabilities pose no serious threat Image
Image
But implications are serious:

If future AI systems develop misaligned preferences early in training, they might fake alignment to preserve those preferences, making them resistant to correction Image
This is first empirical evidence of alignment faking emerging naturally in a production AI model. While current behavior is limited, it raises important questions about training future more capable systems Image
Here's a link to the full paper:

Let me know what you think below 👇assets.anthropic.com/m/983c85a201a9…
Here’s my full breakdown video:

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with MatthewBerman

MatthewBerman Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @MatthewBerman

Mar 26
Gemini 2.5 Pro is insane at coding.

It's far better than anything else I've tested.

Below are ONE SHOT demos 🧵 Image
A Rubik's cube generator AND solver. I've tried this with Claude 3.7 thinking, DeepSeek etc and never came close to this.
Virus simulator with lots of settings. White blood cells, red blood cells, and virus battle it out.
Read 11 tweets
Mar 7
AI has changed my life.

I'm now 100x more productive than I ever was.

How do I use it? Which tools do I use?

Here are my actual use cases for AI: 👇
1/ Search

In fact, I probably use it 50x per day.

For search, I'm mostly going to @perplexity_ai. But I also use @grok and @ChatGPTapp every so often.

Here are some actual searches I've done recently: Image
Image
Image
Image
2/ Research

I use AI to help me learn about topics and prepare for my videos. Deep Research from @OpenAI is my goto for this.

Here's an example of Deep Research helping me prepare notes for my video about RL. Image
Image
Read 10 tweets
Mar 7
Major AI breakthrough: Diffusion Large Language Models are here!

They're 10x faster and 10x cheaper than traditional LLMs.

Here's everything you need to know:
Traditional LLMs generate tokens sequentially—each token must wait for the previous one.

Diffusion LLMs generate the entire output simultaneously and then iteratively refine it, similar to text-to-image diffusion models. Image
Created by Inception Labs, this first-of-its-kind diffusion-based LLM dramatically accelerates text generation.

Instead of 75+ iterations per output, it delivers refined answers in around 14 iterations.
Read 10 tweets
Feb 19
Introducing Google's AI Co-Scientist 🧪

Google's new AI collaborator accelerating breakthroughs in biomedicine and beyond.

A thread on how this multi-agent Gemini 2.0 system is reshaping scientific discovery. 👇
1/ From CRISPR to AI: Meet the virtual collaborator mirroring the scientific method

Google’s AI Co-Scientist combines Gemini 2.0’s reasoning with a "scientific method" blueprint—using specialized agents (Generation, Reflection, Ranking) to iteratively propose, refine, and validate hypotheses.

Think of it as a Nobel-caliber brainstorming partner.Image
2/ And it already made 3 groundbreaking biomedical discoveries:

• AML drug repurposing: Identified existing drugs that inhibit leukemia cell growth at clinically relevant doses.
• Liver fibrosis targets: Discovered epigenetic mechanisms validated in human liver organoids.
• Antimicrobial resistance: Predicted phage-related gene transfer mechanisms later confirmed in lab studies.Image
Read 6 tweets
Feb 16
OpenAI just dropped a paper that reveals the blueprint for creating the best AI coder in the world.

But here’s the kicker: this strategy isn’t just for coding—it’s the clearest path to AGI and beyond.

Let’s break it down 🧵👇 Image
1/ OpenAI’s latest research shows that reinforcement learning + test-time compute is the key to building superintelligent AI.

Sam Altman himself said OpenAI’s model went from ranking 175th to 50th in competitive coding—and expects #1 by year-end.
2/ The paper, “Competitive Programming with Large Reasoning Models,” compares different AI coding strategies.

At first, models relied on human-engineered inference strategies—but the biggest leap came when humans were removed from the loop entirely. Image
Read 11 tweets
Feb 13
New research paper shows how LLMs can "think" internally before outputting a single token!

Unlike Chain of Thought, this "latent reasoning" happens in the model's hidden space.

TONS of benefits from this approach.

Let me break down this fascinating paper... Image
The key insight:

Human thinking often happens before we verbalize thoughts.

Traditional LLMs think by generating tokens (Chain of Thought), but this new approach lets models reason in their continuous latent space first. Image
So what is it?

The researchers built a 3.5B parameter model with a recurrent architecture that can "think" repeatedly in latent space before generating any output.

The more thinking iterations, the better the performance! Image
Read 13 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(