Latest Twitter Threads by @zstevenwu on Thread Reader App

Jun 10 • 8 tweets • 4 min read

Reusing a held-out set adaptively should invite overfitting. Yet in ML we reuse benchmarks for years and they stay informative. Why so little overfitting?

By using LLM agents as extreme compression engines, we get new understanding of why. 🧵

Joint work w/ Martin Bertran and @Aaroth

A natural hypothesis: overfitting or memorizing the quirks of the test set takes a lot to describe. A genuinely good strategy (an architecture, an optimizer, a schedule) is simple.

So if a result survives being squeezed into a few words, it is very unlikely to be overfitting. But how would you test this?

Share this page!

Enter URL or ID to Unroll