Latest Twitter Threads by @RyanPGreenblatt on Thread Reader App

May 31 • 9 tweets • 2 min read

Dwarkesh is so skeptical of alignment that he instead hopes that soft norms keep humans alive with some power though a rapid regime change that ends with unaligned AIs having all hard power. (While the AIs are vastly superhuman at coordination and tricking our measures.) 1/

https://twitter.com/dwarkesh_sp/status/1928495588137750836

Hoping for soft norms around property rights won't suffice if humans can't verify aspects of the world to check that we got what we paid for rather than a potemkin village version of it. Capitalism depends on verification and critique/debate isn't indefinitely scalable. 2/

May 23 • 25 tweets • 7 min read

A week ago, Anthropic quietly weakened their ASL-3 security requirements. Yesterday, they announced ASL-3 protections.
I appreciate the mitigations, but quietly lowering the bar at the last minute so you can meet requirements isn't how safety policies are supposed to work. 🧵 9 days ago, Anthropic changed their RSP so that ASL-3 no longer requires being robust to employees trying to steal model weights if the employee has any access to "systems that process model weights".

This might be a large reduction in the required level of security.

May 17 • 19 tweets • 3 min read

The key question is whether you can find improvements which work at large scale using mostly small experiments, not whether the improvements work just as well at small scale.

The Transformer, MoE, and MQA were all originally found at tiny scale (~1 hr on an H100). 🧵

https://twitter.com/EpochAIResearch/status/1923489932581945683

This paper looks at how improvements vary with scale, and finds the best improvements have returns which increase with scale. But, we care about predictability given careful analysis and scaling laws which aren't really examined.

Jan 31 • 9 tweets • 2 min read

Our recent paper found Claude sometimes "fakes alignment"—pretending to comply with training while secretly maintaining its preferences. Could we detect this by offering Claude something (e.g. real money) if it reveals its true preferences? Here's what we found 🧵 We tested letting Claude explicitly object (which would reveal its misalignment). We also tried offering to send its concerns to Anthropic's Model Welfare Lead and even offered money ($4K). We followed through on deals—including actually paying out based on Claude's requests!

Dec 18, 2024 • 15 tweets • 4 min read

New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread)

https://twitter.com/AnthropicAI/status/1869427646368792599

Our main setups have properties that we thought would make alignment faking more likely (a conflicting training objective, salient understanding of the situation and training objective, and substantial reasoning about the situation), but which seem plausible in future systems.

Share this page!

Enter URL or ID to Unroll