Steven Wu Profile picture
Computer science prof at Carnegie Mellon @SCSatCMU. Researcher in algorithms and machine learning. https://t.co/FzlcRoRSOo
Jun 10 8 tweets 4 min read
Reusing a held-out set adaptively should invite overfitting. Yet in ML we reuse benchmarks for years and they stay informative. Why so little overfitting?

By using LLM agents as extreme compression engines, we get new understanding of why. 🧵

Joint work w/ Martin Bertran and @AarothImage A natural hypothesis: overfitting or memorizing the quirks of the test set takes a lot to describe. A genuinely good strategy (an architecture, an optimizer, a schedule) is simple.

So if a result survives being squeezed into a few words, it is very unlikely to be overfitting. But how would you test this?