Lucas Nestler Profile picture
Sep 20, 2021 7 tweets 5 min read Read on X
Primer combines L1-BN (arxiv.org/abs/1802.09769), Conformer (arxiv.org/abs/2005.08100) and "Squared ReLU" to reach up to 4x faster convergence at no additional memory cost.

This speedup is almost as significant as Switch Transformer's (arxiv.org/abs/2101.03961). It got up to 7x speedups using 64x as many (sparse) parameters.
Primer, however, doesn't use more parameters. It's also orthogonal to Switch, so a combined 32x speedup seems plausible.
There's just one slight issue: The baseline.
Primer compares itself with a default transformer and has no ablations of individual changes.
Instead, they trained a standard 2B GPT3-XL for 2 trillion tokens, spending well over $1,000,000 on this one figure.
For example, @lucidrains found that depthwise convolution helps, but not as much as token-shift. Similarly, SquaredReLU is worse than GEGLU or SquaredReLU-GLU, but Primer doesn't compare against either.

If you want to stay up-to-date, go join #EleutherAI: discord.gg/ybj3dQPs
The fact that Convolution + Attention (or Local + Global) is better than pure local or pure global was explored extensively in works like Nytrömformer (arxiv.org/abs/2102.03902), CvT (arxiv.org/abs/2103.15808) and Long-Short Transformer (arxiv.org/abs/2107.02192).
Similarly, L1-Normalization (arxiv.org/abs/1802.09769) showed higher stability and was verified independently (). So, if anything, Primer indicates that these modifications might be here to stay.
However, nothing that they found is genuinely novel.
In fairness with Primer, they cite CvT as [43], but the difference is minuscule. CvT uses a regular convolution, while Primer "applies convolution for each head separately".
Separate convolutions can be implemented efficiently by simply adding groups to CvT's convolution.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lucas Nestler

Lucas Nestler Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @_clashluke

Nov 20, 2022
Over the past weeks, I've worked on validating @ID_AA_Carmack's hypothesis on how to improve Adam's second-order approximation ()
Resulting from that, I'd like to present TGAdam, an optimizer with up to 50% lower relative error:

1/11
2/11

Unlike AdamW, TGAdam performs well across a wide range of hyperparameters. Additionally, it can significantly outperform the baseline (MNIST+LR=0.1) with minimal tuning.
Below, you can see the aggregated results of over 6986 runs across architectures and datasets:
Large-scale tests on ImageNet or GPT are still outstanding, so take these results with a pile of salt.
However, these results don't come from anywhere. In fact, TGAdamW is theoretically well-motivated.

3/11
Read 11 tweets
Jun 25, 2022
Following a recent discussion sparked by @_arohan_ in this thread:
We tried Shampoo with a few more settings and compared it against AdamW as that's more common than SM3.

TL;DR: Shampoo still is better, but Shampoo#AdamW > AdamW
To go into a bit more detail:
The best pure Adam(W) outperforms the previous best (SM3#Shampoo) by 9.1%.
This is likely caused by our model's significant architectural changes as we switched from Attention to Bottleneck-Convolution+RNN. For Attention, SM3 might still be better.
Interestingly, looking at Adam vs. Adam#Shampoo, it'd appear that the previous benefits vanished entirely. Ths loss difference between these two dropped to 1.35% compared to the previous 3.5% lower loss:
Read 7 tweets
Jun 23, 2022
OpenAI just released a Video-GPT ("VPT") that "solved" Minecraft.
Below, we'll take apart their model to the point where we can start reproducing it.
If you're interested in training this on "the world," join our discord server: discord.gg/24WsKDsV6w
Let's start with their architectural description.
The core of their system has three parts:
1) "Data Cleaning": web-scale scraping and filtering
2) "IDM": a BERT-like model to generate data
3) "VPT": a GPT trained on Video Image
1) Data Cleaning
As with most web-scale datasets, some cleaning has to be done to ensure the model won't be cleaned on unethical inputs such as Minecraft Swastikas. Additionally, they decided to remove hard-to-learn inputs like Facecams and overlays to improve training efficiency Image
Read 17 tweets
Jun 12, 2022
PSA: Switch your optimizer to Shampoo!

We recently tried Shampoo compared to a tuned ensemble of Adam and SM3 at @HomebrewNLP and found that the hyperparameter search space contains many more "winning tickets," which also achieve lower losses!
To be precise, while SM3 trained 7 (0.36%) models to a loss below 1.46, Shampoo achieved that with 255 (11.5%) models. Additionally, the lowest loss is 3.5% lower, which is equivalent to training a 3x bigger model with 3x more data, according to chinchilla's scaling laws.
Unfortunately, this convergence improvement does not come for free. Computing a Shampoo-Update incurs significant overheads as it must compute a matrix inverse for every parameter. Fortunately, the official implementation does this less frequently.
For brevity, ours does not:
Read 9 tweets
Nov 29, 2021
"Sparse is Enough in Scaling Transformers", a recent paper by Sebastian Jaszczur from Google Research, shows 40x speedups at inference using structured sparsity without reducing downstream performance.

Abs: arxiv.org/abs/2111.12763
Code: github.com/google/trax/co…

1/22
Note that, above, the loss plot is not an official image from the paper. Instead, the authors published all of their runs on a public tensorboard: tensorboard.dev/experiment/on3….
This way, we can compare the results ourselves.

2/22
For example, it's a little suspicious how well their "sff64" model performs, considering that "sff32" and "sff128" both underperform the baseline significantly.
So let's try to understand what's going on.

3/22
Read 22 tweets
Nov 25, 2021
I want to retract this tweet publicly:

It is incorrect and causes unnecessary harm to the authors of "PoolFormer: MetaFormer is Actually What You Need for Vision" (arxiv.org/abs/2111.11418).
Using just AvgPool and MLP, they outperform most models.

1/6
First of all, as @Buntworthy pointed out here:
They added a comparison with "ResNet strikes back" (arxiv.org/abs/2110.00476) on GitHub (github.com/sail-sg/poolfo…), showing how they outperform ResNet+ by training PoolFormer with DeiT's augmentations.

2/6
The most incredible part about all of this is that they effectively run
x - LayerNorm(x) + AvgPool(LayerNorm(x))
as a token mixing method, instead of expensive and difficult to scale convolutions or self-attention.

3/6
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(