AI researcher @Stanford. hypernetworks, energy-based models, genomics. Everettian. information geometer.
May 5 • 6 tweets • 3 min read
We developed a unified theory of generalization in deep learning. It explains grokking, double descent, benign overfitting, and implicit bias.
But theory is only half the story. It turns out that optimizing the population risk of any neural network amounts to a small change to your optimizer. 🧵
Why do massive networks generalize on real data instead of just memorizing noise?
As a model trains, its output space dynamically splits. High mobility directions capture the coherent signal. The vast remaining dimensions form a reservoir that safely traps residual noise so it cannot hurt test predictions.