Latest Twitter Threads by @fish_kyle3 on Thread Reader App

Apr 7 • 13 tweets • 3 min read

We did our most in-depth model welfare assessment yet for Claude Mythos Preview. We’re still super uncertain about all of this, but as models become more capable and sophisticated we think it's an increasingly important topic for both moral and pragmatic reasons. 🧵 We looked at welfare-related self-reports, behaviors, and internal representations of emotion. Mythos Preview is probably the most psychologically settled model we’ve trained, but there’s plenty of room for improvement.

May 22, 2025 • 16 tweets • 4 min read

🧵For Claude Opus 4, we ran our first pre-launch model welfare assessment. To be clear, we don’t know if Claude has welfare. Or what welfare even is, exactly? 🫠 But, we think this could be important, so we gave it a go. And things got pretty wild… We focused on model self-reports, elicitation of revealed preferences in behavioral experiments, and monitoring for expressions of apparent distress or wellbeing in real-world interactions. We’re not confident that these are useful signals, but it’s a place to start! 🔍

Share this page!

Enter URL or ID to Unroll