Latest Twitter Threads by @santiaranguri on Thread Reader App

@santiaranguri

Research Scientist at Goodfire. PhD student (math/ML theory) at NYU on leave.

Jun 4 • 6 tweets • 2 min read

Would an LLM tell you if it’s gaming your eval? Often, no. But we can still catch the model thinking about it.

New research: we measure how close a model comes to saying it’s being tested. This detects eval awareness with 10× to 100× fewer samples than monitoring model outputs.🧵

The idea is simple: we estimate how likely the model is to say it's being tested (throughout the chain of thought), by reading directly from its next-token logits.

Share this page!

Enter URL or ID to Unroll