How to get URL link on X (Twitter) App
Noticeably, Sonnet 4.5 verbalizes eval awareness much more than previous models. Does that invalidate our results?
https://twitter.com/janleike/status/1886452525437800874More capable LLMs can be misused to cause more harm. E.g., what if a terrorist can build a weapon of mass destruction with step-by-step guidance from an LLM?
Why are we doing this? RLHF is fundamentally limited by humans' ability to evaluate models – it won't scale well.https://twitter.com/nabla_theta/status/1798763600741585066What are sparse auto-encoders (SAEs)?
For lots of important tasks we don't have ground truth supervision: