Thread by @adamrpearce on Thread Reader App

Do Machine Learning Models Memorize or Generalize?

An interactive introduction to grokking and mechanistic interpretability w/ @ghandeharioun, @nadamused_, @Nithum, @wattenberg and @iislucas https://t.co/ig9dp9GJBepair.withgoogle.com/explorables/gr…

@ghandeharioun @nadamused_ @Nithum @wattenberg @iislucas We first look at task where we know the generalizing solution — sparse parity. You can see the model generalizing as weight decay prunes spurious connections.

One point from @ZimingLiu11 I hadn't internalized before training lots of models: grokking only happens when hyper-parameters are just right.

We can make other weird things happen too, like AdamW oscillating between low train loss and low weights.

@ZimingLiu11 What about a more complex task?

To understand how a MLP solves modular addition, we train a much smaller model with a circular input embedding baked in.

@ZimingLiu11 Following @NeelNanda5 and applying a discrete Fourier transform, we see larger models trained from scratch use the same star trick!

@ZimingLiu11 @NeelNanda5 Finally, we show what the stars are doing and prove that they work

Our ReLU activation has a small error, but it's close enough to the exact solution — an x² activation suggested by Andrey Gromov — for the model to patch everything up w/ constructive interference

@tweet_travior @ghandeharioun @nadamused_ @Nithum @wattenberg @iislucas Also different hyper parameters can make the improvement less sudden.

@strongnewera @ghandeharioun @nadamused_ @Nithum @wattenberg @iislucas Probably not related in a deep way the double descent phenomenon, but on the edge between memorization/generalization you can find models that do double descent during training.

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll