Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Lucas Nestler

@_clashluke

May 18, 2021 • 19 tweets • 8 min read • Read on X

Scrolly

https://twitter.com/Hanxiao_6/status/1394742841033641985

This is major breakthrough 👇
We're now using only seq^2 (4Mi) elements for each attention tensor instead of batch*heads*seq^2 (128Gi) for a PanGu-Alpha-200B-sized model, without reducing the performance or ability to scale.

https://twitter.com/Hanxiao_6/status/1394742841033641985

@Hanxiao_6

I'll implement it immediately in our GPT codebase and share its performance on 2B-equivalent models.

@Hanxiao_6, is the split across channels necessary? You briefly describe it as "effective". Is that on TPU?

I can't figure out what "small initialization" means.
I finally arrived at 0.02 / context_size, which gives the blue curve (500M body + 400M embedding).
It looks very promising, but still NaNs after just 3000 steps with lr=1e-5.

The orange one is the baseline with 1.7B + 400M.
The red one is the same as the orange run but uses ReZero as well.
arxiv.org/abs/2003.04887

Same problem (took 4hrs)
With normalization, activation, initialization, and everything else as mentioned in the paper. The only three differences are that
1) This uses ReZero
2) Single-Headed attention is not added
3) It's masked (GPT)
I'll start another run with MLP-Mixer.

All runs clip the gradient norm at 0.3, compared to 1 above.
Pink is the same config as the light blue one above.
Red is the same as Pink, but with 8 MLP "heads".
Blue is Pink but with beta2=0.95

All runs have roughly 1B params

As a comparison, here is MLP-Mixer compared to the red run above.
Green and light blue use 8 "heads".
Light blue and red have ReZero.
So far Mixer looks like it's much more stable and could benefit from less gradient clipping and a higher learning rate.

While gMLP turned out to be disappointing, I'm incredibly excited to share my first results of MLP-Mixer.
Orange - Baseline (2B)
DarkBlue - gMLP (1B)
LightBlue - MLP-Mixer (1B)
Green - MLP-Mixer (1B, 10x learning rate)

(The baseline diverged in <2k steps with a higher LR)

The runs above all have the same batch size, resulting in 1.3B tokens seen every 10k steps.
The models themselves are wide and shallow (4 blocks) to improve training efficiency during the initial test.
I'll launch two more extensive runs in a few minutes.

Even with a massive warmup, gMLP (orange) and the tanh-constrained gMLP (light blue) still NaN after a few thousand steps. At this point, the learning rate barely crossed 5e-5.
Unfortunately, the baseline attention (dark blue) starts to look much better than MLP-Mixer (red).

All models have the same number of parameters (2.7B) and a depth of 16 attention-ff or ff-mixing blocks.
At this point, both models have barely seen 1.5B tokens, so it remains interesting.

The training runs OOMed after a few thousand steps, but here is the progress they made.
Light/Dark Blue - 2.7B, attention
Red/Orange - 6.5B/2.6B, MLP-Mixer
Light blue and red also have a 4x larger context (2048 instead of 512) than the other two runs.

I only just noticed that I cut off the x-axis in the step-wise plot.
1000 steps taken roughly equates to 260 million tokens seen in all four training runs. At this point, the models have just crossed 1B and 2B tokens, respectively.
I'll report back when they're above 10B.

I'm excited to share a new set of runs. This time, without control flow issues.
* gMLP: blue run with initial plateau (811M)
* Transformer: pink (1.5B)
* Synthesizer: blue curve that follows pink (1.4B)
* CONTAINER: red (1.65B)
* MLP-Mixer: orange (677M)

All models have a constant learning rate of 1e-3, batch size of 4096, adaptive gradient clipping of 0.003 and context size of 128 characters.
With that, the models have all seen 2.5B to 3.2B characters (or 19B to 25B when normalized), which should make them very comparable.

Considering that MLP-Mixer has by far the best convergence (which is likely because of the significantly improved gradient flow) while running twice as fast and using the least memory, I'd recommend seriously looking into these models for any kind of task.

I ran some more tests with Mixer.
This time, SM3 vs Adam.
Both without custom hyperparameter tuning. (The Adam hyperparameters come straight from PanGu-Alpha, SM3's from the paper)
All runs have a batch size of 256 and a context of 2048, and, with that, 1000 steps = 524M chars.

@_arohan_

Just looking at the curves, It becomes apparent how much more stable @_arohan_'s SM3 with Nesterov momentum is.
SM3 with momentum uses only uses half the memory of adam, less compute and yet is more stable, making it perfect for training and finetuning.

Because of the reasons mentioned above, I'd love to see more usage of SM3.

If you want to try it out, check the code, or even reproduce the results, you can find everything here: github.com/tensorfork/OBS….

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @_clashluke

Lucas Nestler

@_clashluke

Nov 20, 2022

@ID_AA_Carmack

Over the past weeks, I've worked on validating @ID_AA_Carmack's hypothesis on how to improve Adam's second-order approximation (

https://twitter.com/ID_AA_Carmack/status/1587863190695813121

)
Resulting from that, I'd like to present TGAdam, an optimizer with up to 50% lower relative error:

https://twitter.com/_clashluke/status/1594284161841479687

1/11

2/11

Unlike AdamW, TGAdam performs well across a wide range of hyperparameters. Additionally, it can significantly outperform the baseline (MNIST+LR=0.1) with minimal tuning.
Below, you can see the aggregated results of over 6986 runs across architectures and datasets:

Large-scale tests on ImageNet or GPT are still outstanding, so take these results with a pile of salt.
However, these results don't come from anywhere. In fact, TGAdamW is theoretically well-motivated.

3/11

Read 11 tweets

Lucas Nestler

@_clashluke

Jun 25, 2022

@_arohan_

Following a recent discussion sparked by @_arohan_ in this thread:

https://twitter.com/_arohan_/status/1538291264226926597

We tried Shampoo with a few more settings and compared it against AdamW as that's more common than SM3.

TL;DR: Shampoo still is better, but Shampoo#AdamW > AdamW

To go into a bit more detail:
The best pure Adam(W) outperforms the previous best (SM3#Shampoo) by 9.1%.
This is likely caused by our model's significant architectural changes as we switched from Attention to Bottleneck-Convolution+RNN. For Attention, SM3 might still be better.

https://twitter.com/_clashluke/status/1535994032404344833

Interestingly, looking at Adam vs. Adam#Shampoo, it'd appear that the previous benefits vanished entirely. Ths loss difference between these two dropped to 1.35% compared to the previous 3.5% lower loss:

https://twitter.com/_clashluke/status/1535994032404344833

Read 7 tweets

Lucas Nestler

@_clashluke

Jun 23, 2022

https://twitter.com/OpenAI/status/1540032456559955968

OpenAI just released a Video-GPT ("VPT") that "solved" Minecraft.
Below, we'll take apart their model to the point where we can start reproducing it.
If you're interested in training this on "the world," join our discord server: discord.gg/24WsKDsV6w

https://twitter.com/OpenAI/status/1540032456559955968

Let's start with their architectural description.
The core of their system has three parts:
1) "Data Cleaning": web-scale scraping and filtering
2) "IDM": a BERT-like model to generate data
3) "VPT": a GPT trained on Video

1) Data Cleaning
As with most web-scale datasets, some cleaning has to be done to ensure the model won't be cleaned on unethical inputs such as Minecraft Swastikas. Additionally, they decided to remove hard-to-learn inputs like Facecams and overlays to improve training efficiency

Read 17 tweets

Lucas Nestler

@_clashluke

Jun 12, 2022

@HomebrewNLP

PSA: Switch your optimizer to Shampoo!

We recently tried Shampoo compared to a tuned ensemble of Adam and SM3 at @HomebrewNLP and found that the hyperparameter search space contains many more "winning tickets," which also achieve lower losses!

To be precise, while SM3 trained 7 (0.36%) models to a loss below 1.46, Shampoo achieved that with 255 (11.5%) models. Additionally, the lowest loss is 3.5% lower, which is equivalent to training a 3x bigger model with 3x more data, according to chinchilla's scaling laws.

Unfortunately, this convergence improvement does not come for free. Computing a Shampoo-Update incurs significant overheads as it must compute a matrix inverse for every parameter. Fortunately, the official implementation does this less frequently.
For brevity, ours does not:

Read 9 tweets

Lucas Nestler

@_clashluke

Nov 29, 2021

"Sparse is Enough in Scaling Transformers", a recent paper by Sebastian Jaszczur from Google Research, shows 40x speedups at inference using structured sparsity without reducing downstream performance.

Abs: arxiv.org/abs/2111.12763
Code: github.com/google/trax/co…

1/22

Note that, above, the loss plot is not an official image from the paper. Instead, the authors published all of their runs on a public tensorboard: tensorboard.dev/experiment/on3….
This way, we can compare the results ourselves.

2/22

For example, it's a little suspicious how well their "sff64" model performs, considering that "sff32" and "sff128" both underperform the baseline significantly.
So let's try to understand what's going on.

3/22

Read 22 tweets

Lucas Nestler

@_clashluke

Nov 25, 2021

https://twitter.com/_clashluke/status/1463061191169822720

I want to retract this tweet publicly:

https://twitter.com/_clashluke/status/1463061191169822720

It is incorrect and causes unnecessary harm to the authors of "PoolFormer: MetaFormer is Actually What You Need for Vision" (arxiv.org/abs/2111.11418).
Using just AvgPool and MLP, they outperform most models.

1/6

@Buntworthy

First of all, as @Buntworthy pointed out here:

https://twitter.com/Buntworthy/status/1463905680004374535

They added a comparison with "ResNet strikes back" (arxiv.org/abs/2110.00476) on GitHub (github.com/sail-sg/poolfo…), showing how they outperform ResNet+ by training PoolFormer with DeiT's augmentations.

2/6

The most incredible part about all of this is that they effectively run
x - LayerNorm(x) + AvgPool(LayerNorm(x))
as a token mixing method, instead of expensive and difficult to scale convolutions or self-attention.

3/6

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Lucas Nestler

Try unrolling a thread yourself!

More from @_clashluke

Lucas Nestler

Lucas Nestler

Lucas Nestler

Lucas Nestler

Lucas Nestler

Lucas Nestler

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!