Do Machine Learning Models Memorize or Generalize?
An interactive introduction to grokking and mechanistic interpretability w/ @ghandeharioun, @nadamused_, @Nithum, @wattenberg and @iislucas https://t.co/ig9dp9GJBepair.withgoogle.com/explorables/gr…
@ghandeharioun @nadamused_ @Nithum @wattenberg @iislucas We first look at task where we know the generalizing solution — sparse parity. You can see the model generalizing as weight decay prunes spurious connections.
One point from @ZimingLiu11 I hadn't internalized before training lots of models: grokking only happens when hyper-parameters are just right.
We can make other weird things happen too, like AdamW oscillating between low train loss and low weights.
@ZimingLiu11 What about a more complex task?
To understand how a MLP solves modular addition, we train a much smaller model with a circular input embedding baked in.
@ZimingLiu11 Following @NeelNanda5 and applying a discrete Fourier transform, we see larger models trained from scratch use the same star trick!
@ZimingLiu11 @NeelNanda5 Finally, we show what the stars are doing and prove that they work
Our ReLU activation has a small error, but it's close enough to the exact solution — an x² activation suggested by Andrey Gromov — for the model to patch everything up w/ constructive interference
@tweet_travior @ghandeharioun @nadamused_ @Nithum @wattenberg @iislucas Also different hyper parameters can make the improvement less sudden.
@strongnewera @ghandeharioun @nadamused_ @Nithum @wattenberg @iislucas Probably not related in a deep way the double descent phenomenon, but on the edge between memorization/generalization you can find models that do double descent during training.
• • •
Missing some Tweet in this thread? You can try to
force a refresh