Latest Twitter Threads by @elon_lit on Thread Reader App

May 5 • 6 tweets • 3 min read

We developed a unified theory of generalization in deep learning. It explains grokking, double descent, benign overfitting, and implicit bias.

But theory is only half the story. It turns out that optimizing the population risk of any neural network amounts to a small change to your optimizer. 🧵

Why do massive networks generalize on real data instead of just memorizing noise?

As a model trains, its output space dynamically splits. High mobility directions capture the coherent signal. The vast remaining dimensions form a reservoir that safely traps residual noise so it cannot hurt test predictions.

Share this page!

Enter URL or ID to Unroll