Research Scientist at Goodfire. PhD student (math/ML theory) at NYU on leave.
Jun 4 • 6 tweets • 2 min read
Would an LLM tell you if it’s gaming your eval? Often, no. But we can still catch the model thinking about it.
New research: we measure how close a model comes to saying it’s being tested. This detects eval awareness with 10× to 100× fewer samples than monitoring model outputs.🧵
The idea is simple: we estimate how likely the model is to say it's being tested (throughout the chain of thought), by reading directly from its next-token logits.