Tweet

Lucas Nestler

20 Sep, 7 tweets, 5 min read

https://twitter.com/ak92501/status/1439751096969334785

Primer combines L1-BN (arxiv.org/abs/1802.09769), Conformer (arxiv.org/abs/2005.08100) and "Squared ReLU" to reach up to 4x faster convergence at no additional memory cost.

https://twitter.com/ak92501/status/1439751096969334785

This speedup is almost as significant as Switch Transformer's (arxiv.org/abs/2101.03961). It got up to 7x speedups using 64x as many (sparse) parameters.
Primer, however, doesn't use more parameters. It's also orthogonal to Switch, so a combined 32x speedup seems plausible.

There's just one slight issue: The baseline.
Primer compares itself with a default transformer and has no ablations of individual changes.
Instead, they trained a standard 2B GPT3-XL for 2 trillion tokens, spending well over $1,000,000 on this one figure.

@lucidrains

For example, @lucidrains found that depthwise convolution helps, but not as much as token-shift. Similarly, SquaredReLU is worse than GEGLU or SquaredReLU-GLU, but Primer doesn't compare against either.

If you want to stay up-to-date, go join #EleutherAI: discord.gg/ybj3dQPs

The fact that Convolution + Attention (or Local + Global) is better than pure local or pure global was explored extensively in works like Nytrömformer (arxiv.org/abs/2102.03902), CvT (arxiv.org/abs/2103.15808) and Long-Short Transformer (arxiv.org/abs/2107.02192).

https://twitter.com/wightmanr/status/1435080008129601542

Similarly, L1-Normalization (arxiv.org/abs/1802.09769) showed higher stability and was verified independently (

https://twitter.com/wightmanr/status/1435080008129601542

). So, if anything, Primer indicates that these modifications might be here to stay.
However, nothing that they found is genuinely novel.

In fairness with Primer, they cite CvT as [43], but the difference is minuscule. CvT uses a regular convolution, while Primer "applies convolution for each head separately".
Separate convolutions can be implemented efficiently by simply adding groups to CvT's convolution.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @_clashluke

Lucas Nestler

@_clashluke

27 Jul

https://twitter.com/ak92501/status/1419824931181846528

Finally, someone did it.
MoE + Weight sharing.
This is amazing.

WideNet finds a way to combine two time-parameter tradeoffs to reduce the final training time and parameter count.

https://twitter.com/ak92501/status/1419824931181846528

With fewer parameters, layers, and lower training time, they achieve a 3.2% (relative) lower top-1 error.
Their experiments also illustrate that ViT by itself can learn with weight sharing, which is incredibly exciting.

ALBERT (arxiv.org/abs/1909.11942) proposed the same thing for language models two years ago and found that adding weight sharing reduces parameter (and with that memory) consumption significantly but makes the model slower train.
Just like WideNet, they don't share LayerNorm

Read 7 tweets

Lucas Nestler

@_clashluke

15 Jul

@RiversHaveWings

I finally got around to playing with @RiversHaveWings's VQGAN+CLIP notebooks!
The first order of business was to try to reproduce @ak92501's beautiful samples. You can see the results of my journey below (seeds=0 and 123456)

1/5

https://twitter.com/ak92501/status/1414020174357934086

To reasonably create these samples, I attempted to optimize the model by jitting it with TorchScript. After countless wrong attempts, it's finally 5x as fast as the baseline. (If you're using PyTorch, try JIT. You might want to follow my notebook for further optimizations.)

2/5

I also added new features, such as gaussian dropout and noise, which immediately improved the samples.
Below you can see the same prompt with different sample-wide noise (S) and per-item noise (I).

1) S=0.05, I=0.01
2) S=0.25, I=0.10
3) S=0.10, I=0.153
4) S=0.25, I=0.125

3/5

Read 6 tweets

Lucas Nestler

@_clashluke

18 May

https://twitter.com/Hanxiao_6/status/1394742841033641985

This is major breakthrough 👇
We're now using only seq^2 (4Mi) elements for each attention tensor instead of batch*heads*seq^2 (128Gi) for a PanGu-Alpha-200B-sized model, without reducing the performance or ability to scale.

https://twitter.com/Hanxiao_6/status/1394742841033641985

@Hanxiao_6

I'll implement it immediately in our GPT codebase and share its performance on 2B-equivalent models.

@Hanxiao_6, is the split across channels necessary? You briefly describe it as "effective". Is that on TPU?

I can't figure out what "small initialization" means.
I finally arrived at 0.02 / context_size, which gives the blue curve (500M body + 400M embedding).
It looks very promising, but still NaNs after just 3000 steps with lr=1e-5.

Read 19 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

Lucas Nestler

Try unrolling a thread yourself!

More from @_clashluke

Lucas Nestler

Lucas Nestler

Lucas Nestler

Did Thread Reader help you today?

Like this author's thread?