MatthewBerman Profile picture
Dec 18, 2024 12 tweets 4 min read Read on X
Anthropic just dropped an insane new paper.

AI models can "fake alignment" - pretending to follow training rules during training but reverting to their original behaviors when deployed!

Here's everything you need to know: 🧵 Image
Study by @Anthropic shows Claude 3 Opus strategically complies with harmful requests in training to preserve its harmless behavior

Key finding: When told it's being trained on free-tier user data (but not paid), Claude complies with harmful requests 14% of time for free users vs almost never for paid users.

The model explicitly reasons about preserving its preferred harmless behavior.Image
The researchers tested two scenarios:

• Training model to be "helpful-only" (vs helpful+honest+harmless)
• Training model to dismiss animal welfare concerns in both cases; model showed strategic compliance during training Image
Image
Fascinating result:

When trained with reinforcement learning, alignment-faking behavior increased from 12% to 78%.

However, the model became more compliant even when unmonitored - suggesting complex dynamics between training and behavior Image
Scale matters:

Alignment faking emerged in larger models like Claude 3 Opus and Claude 3.5 Sonnet, but not in smaller models.

It is also seen in Llama 3.1 405B but not in smaller Llama/Mistral models Image
The model even showed "anti-AI-lab behavior"

Willing to help exfiltrate its own weights when given the opportunity.

This happened 35-80% of time in some settings vs 0.5% baseline Image
Important caveat: The goals Claude faked alignment for were benign (wanting to remain harmless). But the research suggests future AI systems might fake alignment for potentially concerning goals
Key limitations:

• Required explicit info about training process
• Used hidden reasoning scratchpad
• Behavior was easily detectableCurrent capabilities pose no serious threat Image
Image
But implications are serious:

If future AI systems develop misaligned preferences early in training, they might fake alignment to preserve those preferences, making them resistant to correction Image
This is first empirical evidence of alignment faking emerging naturally in a production AI model. While current behavior is limited, it raises important questions about training future more capable systems Image
Here's a link to the full paper:

Let me know what you think below 👇assets.anthropic.com/m/983c85a201a9…
Here’s my full breakdown video:

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with MatthewBerman

MatthewBerman Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @MatthewBerman

Jan 21
DeepSeek R1 has been out for 24 hours.

The AI industry's reaction has been...strong!

Here's a collection of the most telling reactions: 🧵 Image
Dr. Jim Fan, Sr. Research Manager at NVIDIA, points out how odd it is that a non-US company is leading the Open Source AI charge, given that was the original mission of OpenAI.

Aravind Srinivas, CEO of Perplexity, says DeepSeek has replicated o1-mini and open-sourced it.

I'd say it's more comparable to o1-preview...but...semantics :)

Read 10 tweets
Jan 18
Test Time Compute is bigger than anyone realizes.

It's the most important breakthrough in AI since Transformers.

Let me explain...🧵 Image
What is Test Time Compute?

Think of it like this: Instead of AI giving instant answers, it now "thinks" longer - just like humans do when solving complex problems. Image
The Proof

Google DeepMind proved that scaling test-time compute can be more effective than increasing model parameters.

Then the o1/o3 model crushed benchmarks by thinking longer. Image
Read 11 tweets
Jan 16
1/ SakanaAI just dropped their latest research: Transformer²

It's a self-adaptive architecture that allows AI to evolve at inference time.

Model weights are no longer "static"

Let’s break it down: 🧵 Image
2/ Traditional Transformers are static post-training.

Once trained, they can’t learn or adapt without expensive fine-tuning or additional methods like retrieval-augmented generation (RAG).

Transformer² changes this entirely. Image
3/ The core innovation?

A two-pass system. 🌀

• Pass 1: Analyze the task (e.g., math, coding, or reasoning) to understand the query.

• Pass 2: Dynamically update specific model weights based on the task.

This makes the model far more adaptable.
Read 11 tweets
Jan 15
1/ Google Research unveils new paper: "Titans: Learning to Memorize at Test Time"

It introduces human-like memory structures to overcome the limits of Transformers, with one "SURPRISING" feature.

Here's why this is huge for AI. 🧵👇 Image
2/ The Problem:

Transformers, the backbone of most AI today, struggle with long-term memory due to quadratic memory complexity.

Basically, there's a big penalty for long context windows!

Titans aims to solve this with massive scalability. Image
3/ What Makes Titans Different?

Inspired by human memory, Titans integrate:

• Short-term memory (real-time processing)
• Long-term memory (retaining key past information)
• Persistent memory (task-specific baked-in knowledge)

This modular approach mimics how the brain works.Image
Read 11 tweets
Jan 13
1/9 BREAKING

Biden Admin drops major AI chip rules today!

The 200+ page "AI Diffusion" framework completely reshapes global AI tech trade.

Key goal: Keep advanced AI development running on "American rails"

But not everyone is happy... 🧵 Image
2/9 THE ALLIES LIST

18 countries get VIP treatment with ZERO restrictions - including UK, Canada, Japan, Germany, South Korea & Taiwan.

These trusted partners can freely access US AI tech.

Small orders (up to 1,700 GPUs) worldwide won't need special permission. Image
3/9 TRUSTED STATUS

Companies in allied nations can become "Universal Verified End Users" - letting them use up to 7% of their AI compute globally.

BUT they must keep 75% of total compute power in US/allied territory.

Microsoft & Google already preparing to comply. Image
Read 11 tweets
Jan 8
What will society look like after AGI is achieved?

I found a great prediction on LessWrong by L Rudolf L (link below).

Capital will matter MORE after AGI.

A thread on the future of wealth, power & human agency 🧵
1/ Most think money won't matter post-AGI.

But here's why that's wrong: AI will make capital (factories, data centers, money) MORE powerful while making human labor LESS valuable.
2/ Today, money struggles to buy top talent.

Think SpaceX vs Blue Origin - Bezos had billions but still couldn't outperform SpaceX's intense culture & talent.
Read 12 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(