AI models can "fake alignment" - pretending to follow training rules during training but reverting to their original behaviors when deployed!
Here's everything you need to know: 🧵
Study by @Anthropic shows Claude 3 Opus strategically complies with harmful requests in training to preserve its harmless behavior
Key finding: When told it's being trained on free-tier user data (but not paid), Claude complies with harmful requests 14% of time for free users vs almost never for paid users.
The model explicitly reasons about preserving its preferred harmless behavior.
The researchers tested two scenarios:
• Training model to be "helpful-only" (vs helpful+honest+harmless)
• Training model to dismiss animal welfare concerns in both cases; model showed strategic compliance during training
Fascinating result:
When trained with reinforcement learning, alignment-faking behavior increased from 12% to 78%.
However, the model became more compliant even when unmonitored - suggesting complex dynamics between training and behavior
Scale matters:
Alignment faking emerged in larger models like Claude 3 Opus and Claude 3.5 Sonnet, but not in smaller models.
It is also seen in Llama 3.1 405B but not in smaller Llama/Mistral models
The model even showed "anti-AI-lab behavior"
Willing to help exfiltrate its own weights when given the opportunity.
This happened 35-80% of time in some settings vs 0.5% baseline
Important caveat: The goals Claude faked alignment for were benign (wanting to remain harmless). But the research suggests future AI systems might fake alignment for potentially concerning goals
Key limitations:
• Required explicit info about training process
• Used hidden reasoning scratchpad
• Behavior was easily detectableCurrent capabilities pose no serious threat
But implications are serious:
If future AI systems develop misaligned preferences early in training, they might fake alignment to preserve those preferences, making them resistant to correction
This is first empirical evidence of alignment faking emerging naturally in a production AI model. While current behavior is limited, it raises important questions about training future more capable systems
.@AnthropicAI just published a WILD new AI jailbreaking technique
Not only does it crack EVERY frontier model, but it's also super easy to do.
ThIS iZ aLL iT TakE$ 🔥
Here's everything you need to know: 🧵
Introducing Best-of-N (BoN) Jailbreaking: a black-box algorithm that bypasses AI system safeguards across various modalities.
BoN operates by generating multiple prompt variations through augmentations like random shuffling or capitalization, continuing until a harmful response is produced.
In tests, BoN achieved high attack success rates (ASRs) on closed-source language models:
• 89% on GPT-4o
• 78% on Claude 3.5 Sonnet
This was accomplished by sampling 10,000 augmented prompts.
NVIDIA just dropped a game-changing tiny supercomputer:
The Jetson Orin Nano.
Here's why this is massive for edge AI... 🧵
At just $249, this pocket-sized powerhouse can run large language models LOCALLY - no cloud needed.
It's delivering nearly 70 TRILLION operations per second at just 25 watts!
Why this matters: We're entering the era of edge AI.
Soon, powerful AI will run independently on devices everywhere - from robots to IoT devices to cars. No cloud connection required.
The specs are wild:
• Ampere architecture
• 1,024 CUDA cores
• 32 tensor cores
• 6-core ARM Cortex CPU (64-bit)
• 8GB fast memory (102GB/s)
• SD card slot for easy OS loading