Ryan Greenblatt Profile picture
Chief scientist at Redwood Research (@redwood_ai), focused on technical AI safety research to reduce risks from rogue AIs
May 31 9 tweets 2 min read
Dwarkesh is so skeptical of alignment that he instead hopes that soft norms keep humans alive with some power though a rapid regime change that ends with unaligned AIs having all hard power. (While the AIs are vastly superhuman at coordination and tricking our measures.) 1/ Hoping for soft norms around property rights won't suffice if humans can't verify aspects of the world to check that we got what we paid for rather than a potemkin village version of it. Capitalism depends on verification and critique/debate isn't indefinitely scalable. 2/
May 23 25 tweets 7 min read
A week ago, Anthropic quietly weakened their ASL-3 security requirements. Yesterday, they announced ASL-3 protections.
I appreciate the mitigations, but quietly lowering the bar at the last minute so you can meet requirements isn't how safety policies are supposed to work. 🧵 9 days ago, Anthropic changed their RSP so that ASL-3 no longer requires being robust to employees trying to steal model weights if the employee has any access to "systems that process model weights".

This might be a large reduction in the required level of security. Image
May 17 19 tweets 3 min read
The key question is whether you can find improvements which work at large scale using mostly small experiments, not whether the improvements work just as well at small scale.

The Transformer, MoE, and MQA were all originally found at tiny scale (~1 hr on an H100). 🧵 This paper looks at how improvements vary with scale, and finds the best improvements have returns which increase with scale. But, we care about predictability given careful analysis and scaling laws which aren't really examined.
Jan 31 9 tweets 2 min read
Our recent paper found Claude sometimes "fakes alignment"—pretending to comply with training while secretly maintaining its preferences. Could we detect this by offering Claude something (e.g. real money) if it reveals its true preferences? Here's what we found 🧵 We tested letting Claude explicitly object (which would reveal its misalignment). We also tried offering to send its concerns to Anthropic's Model Welfare Lead and even offered money ($4K). We followed through on deals—including actually paying out based on Claude's requests!
Dec 18, 2024 15 tweets 4 min read
New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread) Our main setups have properties that we thought would make alignment faking more likely (a conflicting training objective, salient understanding of the situation and training objective, and substantial reasoning about the situation), but which seem plausible in future systems.