Tweet

Lucas Nestler

Jun 12 • 9 tweets • 6 min read

@HomebrewNLP

PSA: Switch your optimizer to Shampoo!

We recently tried Shampoo compared to a tuned ensemble of Adam and SM3 at @HomebrewNLP and found that the hyperparameter search space contains many more "winning tickets," which also achieve lower losses!

To be precise, while SM3 trained 7 (0.36%) models to a loss below 1.46, Shampoo achieved that with 255 (11.5%) models. Additionally, the lowest loss is 3.5% lower, which is equivalent to training a 3x bigger model with 3x more data, according to chinchilla's scaling laws.

Unfortunately, this convergence improvement does not come for free. Computing a Shampoo-Update incurs significant overheads as it must compute a matrix inverse for every parameter. Fortunately, the official implementation does this less frequently.
For brevity, ours does not:

However, shampoo trains faster than the baseline even when inverting the parameter matrix at every update. Additionally, increasing the batch size from 16 to 256 already reduces the overhead from 25% to 4.1%, so there's no need to worry.

Most importantly, shampoo increases the range of "good" hyperparameters. This way, you need to worry about one less hyperparameter when starting a new project.
Looking at the plot below, it seems as if shampoo accepts virtually any configuration and returns a great model.

If you'd like to try it out, you're in luck because there are various implementations!
Optax: github.com/google-researc…
Minimal (Jax): github.com/HomebrewNLP/Ho…
PyTorch: github.com/facebookresear…

All experiments shared above can be found in this WandB project: wandb.ai/homebrewnlp/gpt.

https://twitter.com/_clashluke/status/1506693368365174790

Lastly, I'd like to thank TensorFork and the TPU Research Cloud for funding this project, as the sweeps above used over 85000 (preemptible) TPU-core hours. If you'd like to learn more about them, have a look at my previous thread:

https://twitter.com/_clashluke/status/1506693368365174790

@_arohan_

Above, I only showed _that_ Shampoo works but didn't explain how it achieves these massive improvements.
Luckily, @_arohan_ wrote a detailed thread explaining the inner workings and related work:

https://twitter.com/_arohan_/status/1536115857025183745

@ykilcher

In a paper review, @ykilcher also explained one of the critical components that make Shampoo work: Optimizer Grafting
I'd definitely recommend checking it out:

https://twitter.com/ykilcher/status/1462086401067995139

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @_clashluke

Lucas Nestler

@_clashluke

Nov 29, 2021

"Sparse is Enough in Scaling Transformers", a recent paper by Sebastian Jaszczur from Google Research, shows 40x speedups at inference using structured sparsity without reducing downstream performance.

Abs: arxiv.org/abs/2111.12763
Code: github.com/google/trax/co…

1/22

Note that, above, the loss plot is not an official image from the paper. Instead, the authors published all of their runs on a public tensorboard: tensorboard.dev/experiment/on3….
This way, we can compare the results ourselves.

2/22

For example, it's a little suspicious how well their "sff64" model performs, considering that "sff32" and "sff128" both underperform the baseline significantly.
So let's try to understand what's going on.

3/22

Read 22 tweets

Lucas Nestler

@_clashluke

Nov 25, 2021

https://twitter.com/_clashluke/status/1463061191169822720

I want to retract this tweet publicly:

https://twitter.com/_clashluke/status/1463061191169822720

It is incorrect and causes unnecessary harm to the authors of "PoolFormer: MetaFormer is Actually What You Need for Vision" (arxiv.org/abs/2111.11418).
Using just AvgPool and MLP, they outperform most models.

1/6

@Buntworthy

First of all, as @Buntworthy pointed out here:

https://twitter.com/Buntworthy/status/1463905680004374535

They added a comparison with "ResNet strikes back" (arxiv.org/abs/2110.00476) on GitHub (github.com/sail-sg/poolfo…), showing how they outperform ResNet+ by training PoolFormer with DeiT's augmentations.

2/6

The most incredible part about all of this is that they effectively run
x - LayerNorm(x) + AvgPool(LayerNorm(x))
as a token mixing method, instead of expensive and difficult to scale convolutions or self-attention.

3/6

Read 6 tweets

Lucas Nestler

@_clashluke

Sep 20, 2021

https://twitter.com/ak92501/status/1439751096969334785

Primer combines L1-BN (arxiv.org/abs/1802.09769), Conformer (arxiv.org/abs/2005.08100) and "Squared ReLU" to reach up to 4x faster convergence at no additional memory cost.

https://twitter.com/ak92501/status/1439751096969334785

This speedup is almost as significant as Switch Transformer's (arxiv.org/abs/2101.03961). It got up to 7x speedups using 64x as many (sparse) parameters.
Primer, however, doesn't use more parameters. It's also orthogonal to Switch, so a combined 32x speedup seems plausible.

There's just one slight issue: The baseline.
Primer compares itself with a default transformer and has no ablations of individual changes.
Instead, they trained a standard 2B GPT3-XL for 2 trillion tokens, spending well over $1,000,000 on this one figure.

Read 7 tweets

Lucas Nestler

@_clashluke

Jul 27, 2021

https://twitter.com/ak92501/status/1419824931181846528

Finally, someone did it.
MoE + Weight sharing.
This is amazing.

WideNet finds a way to combine two time-parameter tradeoffs to reduce the final training time and parameter count.

https://twitter.com/ak92501/status/1419824931181846528

With fewer parameters, layers, and lower training time, they achieve a 3.2% (relative) lower top-1 error.
Their experiments also illustrate that ViT by itself can learn with weight sharing, which is incredibly exciting.

ALBERT (arxiv.org/abs/1909.11942) proposed the same thing for language models two years ago and found that adding weight sharing reduces parameter (and with that memory) consumption significantly but makes the model slower train.
Just like WideNet, they don't share LayerNorm

Read 7 tweets

Lucas Nestler

@_clashluke

Jul 15, 2021

@RiversHaveWings

I finally got around to playing with @RiversHaveWings's VQGAN+CLIP notebooks!
The first order of business was to try to reproduce @ak92501's beautiful samples. You can see the results of my journey below (seeds=0 and 123456)

1/5

https://twitter.com/ak92501/status/1414020174357934086

To reasonably create these samples, I attempted to optimize the model by jitting it with TorchScript. After countless wrong attempts, it's finally 5x as fast as the baseline. (If you're using PyTorch, try JIT. You might want to follow my notebook for further optimizations.)

2/5

I also added new features, such as gaussian dropout and noise, which immediately improved the samples.
Below you can see the same prompt with different sample-wide noise (S) and per-item noise (I).

1) S=0.05, I=0.01
2) S=0.25, I=0.10
3) S=0.10, I=0.153
4) S=0.25, I=0.125

3/5

Read 6 tweets

Lucas Nestler

@_clashluke

May 18, 2021

https://twitter.com/Hanxiao_6/status/1394742841033641985

This is major breakthrough 👇
We're now using only seq^2 (4Mi) elements for each attention tensor instead of batch*heads*seq^2 (128Gi) for a PanGu-Alpha-200B-sized model, without reducing the performance or ability to scale.

https://twitter.com/Hanxiao_6/status/1394742841033641985

@Hanxiao_6

I'll implement it immediately in our GPT codebase and share its performance on 2B-equivalent models.

@Hanxiao_6, is the split across channels necessary? You briefly describe it as "effective". Is that on TPU?

I can't figure out what "small initialization" means.
I finally arrived at 0.02 / context_size, which gives the blue curve (500M body + 400M embedding).
It looks very promising, but still NaNs after just 3000 steps with lr=1e-5.

Read 19 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Lucas Nestler

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @_clashluke

Lucas Nestler

Lucas Nestler

Lucas Nestler

Lucas Nestler

Lucas Nestler

Lucas Nestler

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?