Researcher at METR. Working out when AI models are scary.
May 10 • 11 tweets • 4 min read
Can we catch models that fake alignment by examining their internals?
Our new paper trains 'sleeper agents' that strategically pretend to be safe and finds methods to unmask them.
There is lively debate about whether model internals are useful for detecting alignment fakers (models that strategically hide harmful intentions). To ground this debate, we propose a testbed for detecting alignment-faking LLMs.
Mar 20 • 12 tweets • 4 min read
If developers had to prove to regulators that powerful AI systems are safe to deploy, what are the best arguments they could use?
Our new report tackles the (very big!) question of how to make a ‘safety case’ for AI.
We define a safety case as a rationale developers provide to regulators to show that their AI systems are unlikely to cause a catastrophe.
The term ‘safety case’ is not new. In many industries (e.g. aviation), products are ‘put on trial’ before they are released.