PSA: Switch your optimizer to Shampoo!
We recently tried Shampoo compared to a tuned ensemble of Adam and SM3 at @HomebrewNLP and found that the hyperparameter search space contains many more "winning tickets," which also achieve lower losses!
To be precise, while SM3 trained 7 (0.36%) models to a loss below 1.46, Shampoo achieved that with 255 (11.5%) models. Additionally, the lowest loss is 3.5% lower, which is equivalent to training a 3x bigger model with 3x more data, according to chinchilla's scaling laws.
Unfortunately, this convergence improvement does not come for free. Computing a Shampoo-Update incurs significant overheads as it must compute a matrix inverse for every parameter. Fortunately, the official implementation does this less frequently.
For brevity, ours does not:
However, shampoo trains faster than the baseline even when inverting the parameter matrix at every update. Additionally, increasing the batch size from 16 to 256 already reduces the overhead from 25% to 4.1%, so there's no need to worry.
Most importantly, shampoo increases the range of "good" hyperparameters. This way, you need to worry about one less hyperparameter when starting a new project.
Looking at the plot below, it seems as if shampoo accepts virtually any configuration and returns a great model.
If you'd like to try it out, you're in luck because there are various implementations!
Optax: github.com/google-researc…
Minimal (Jax): github.com/HomebrewNLP/Ho…
PyTorch: github.com/facebookresear…
All experiments shared above can be found in this WandB project: wandb.ai/homebrewnlp/gpt.
Lastly, I'd like to thank TensorFork and the TPU Research Cloud for funding this project, as the sweeps above used over 85000 (preemptible) TPU-core hours. If you'd like to learn more about them, have a look at my previous thread:
Above, I only showed _that_ Shampoo works but didn't explain how it achieves these massive improvements.
Luckily, @_arohan_ wrote a detailed thread explaining the inner workings and related work:
In a paper review, @ykilcher also explained one of the critical components that make Shampoo work: Optimizer Grafting
I'd definitely recommend checking it out:
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.