MatthewBerman Profile picture
Dec 18 12 tweets 4 min read Read on X
Anthropic just dropped an insane new paper.

AI models can "fake alignment" - pretending to follow training rules during training but reverting to their original behaviors when deployed!

Here's everything you need to know: 🧵 Image
Study by @Anthropic shows Claude 3 Opus strategically complies with harmful requests in training to preserve its harmless behavior

Key finding: When told it's being trained on free-tier user data (but not paid), Claude complies with harmful requests 14% of time for free users vs almost never for paid users.

The model explicitly reasons about preserving its preferred harmless behavior.Image
The researchers tested two scenarios:

• Training model to be "helpful-only" (vs helpful+honest+harmless)
• Training model to dismiss animal welfare concerns in both cases; model showed strategic compliance during training Image
Image
Fascinating result:

When trained with reinforcement learning, alignment-faking behavior increased from 12% to 78%.

However, the model became more compliant even when unmonitored - suggesting complex dynamics between training and behavior Image
Scale matters:

Alignment faking emerged in larger models like Claude 3 Opus and Claude 3.5 Sonnet, but not in smaller models.

It is also seen in Llama 3.1 405B but not in smaller Llama/Mistral models Image
The model even showed "anti-AI-lab behavior"

Willing to help exfiltrate its own weights when given the opportunity.

This happened 35-80% of time in some settings vs 0.5% baseline Image
Important caveat: The goals Claude faked alignment for were benign (wanting to remain harmless). But the research suggests future AI systems might fake alignment for potentially concerning goals
Key limitations:

• Required explicit info about training process
• Used hidden reasoning scratchpad
• Behavior was easily detectableCurrent capabilities pose no serious threat Image
Image
But implications are serious:

If future AI systems develop misaligned preferences early in training, they might fake alignment to preserve those preferences, making them resistant to correction Image
This is first empirical evidence of alignment faking emerging naturally in a production AI model. While current behavior is limited, it raises important questions about training future more capable systems Image
Here's a link to the full paper:

Let me know what you think below 👇assets.anthropic.com/m/983c85a201a9…
Here’s my full breakdown video:

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with MatthewBerman

MatthewBerman Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @MatthewBerman

Dec 20
.@OpenAI just dropped o3 and o3-mini!

This is AGI (not clickbait)

o3 is the best AI ever created, and its performance is WILD.

Here's everything you need to know: 🧵 Image
O3 is their most advanced model yet.

Excelling at coding, mathematics, and even PhD-level science tasks.

O3 Mini offers incredible cost-performance optimization, perfect for diverse use cases. Image
Coding performance:

• 01: ELO ~1891 on Codeforces
• O3: ELO ~2727 🔥This model is redefining what’s possible in competitive programming. 👨‍💻 Image
Read 8 tweets
Dec 20
.@AnthropicAI just published a WILD new AI jailbreaking technique

Not only does it crack EVERY frontier model, but it's also super easy to do.

ThIS iZ aLL iT TakE$ 🔥

Here's everything you need to know: 🧵 Image
Introducing Best-of-N (BoN) Jailbreaking: a black-box algorithm that bypasses AI system safeguards across various modalities.

BoN operates by generating multiple prompt variations through augmentations like random shuffling or capitalization, continuing until a harmful response is produced.Image
In tests, BoN achieved high attack success rates (ASRs) on closed-source language models:

• 89% on GPT-4o
• 78% on Claude 3.5 Sonnet
This was accomplished by sampling 10,000 augmented prompts. Image
Read 7 tweets
Dec 18
NVIDIA just dropped a game-changing tiny supercomputer:

The Jetson Orin Nano.

Here's why this is massive for edge AI... 🧵 Image
At just $249, this pocket-sized powerhouse can run large language models LOCALLY - no cloud needed.

It's delivering nearly 70 TRILLION operations per second at just 25 watts!

Why this matters: We're entering the era of edge AI.

Soon, powerful AI will run independently on devices everywhere - from robots to IoT devices to cars. No cloud connection required.Image
The specs are wild:

• Ampere architecture
• 1,024 CUDA cores
• 32 tensor cores
• 6-core ARM Cortex CPU (64-bit)
• 8GB fast memory (102GB/s)
• SD card slot for easy OS loading Image
Read 7 tweets
Dec 14
Google just announced Android XR!

A new operating system for extended reality devices like headsets and glasses, built with AI (Gemini 2.0) from the ground up.

Here's the wild vision of what the future of AI looks like: 🧵 Image
Check out this video explaining why Android XR is so important:
🤝 Developed in collaboration with Samsung

Android XR will support a vibrant ecosystem of developers and device makers.

It's compatible with tools like ARCore, Android Studio, Unity, and OpenXR, making app development easier.
Read 9 tweets
Dec 13
Cohere just dropped Command R7B

The smallest, fastest, state-of-the-art enterprise-grade LLM.

The best part? It’s Open Weights!

Here’s everything you need to know 🧵

(Cohere Partner) Image
The Final Model

Command R7B is the final model in this series of models.

It’s built for real-world tasks for developers and businesses.

It excels at multi-lingual support, citation-verified RAG, reasoning, tool use, and agentic behavior. Image
Image
Well-Rounded

Compared to other open-weights models, Cohere ranks #1 on average.

For its size, it performs incredibly well.

Check out these results: Image
Read 8 tweets
Dec 11
Google just dropped Gemini 2.0!

Tons of awesome things just released.

Here's everything you need to know: 🧵 Image
Gemini 2.0 Flash is launching today

It's faster than 1.5 Pro and comes with new features like native image generation and text-to-speech in multiple languages. Image
The big news: It's now "agentic"

This means it can understand your world better, think ahead, and take actions on your behalf (with your supervision)
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(