Finally, someone did it.
MoE + Weight sharing.
This is amazing.

WideNet finds a way to combine two time-parameter tradeoffs to reduce the final training time and parameter count.
With fewer parameters, layers, and lower training time, they achieve a 3.2% (relative) lower top-1 error.
Their experiments also illustrate that ViT by itself can learn with weight sharing, which is incredibly exciting.
ALBERT (arxiv.org/abs/1909.11942) proposed the same thing for language models two years ago and found that adding weight sharing reduces parameter (and with that memory) consumption significantly but makes the model slower train.
Just like WideNet, they don't share LayerNorm
WideNet investigated the same thing by checking whether MoE helps, and if so, how much.
The unexpected thing here is that WideNet-L performs better with parameter sharing. This could be because of the cleaner and stronger gradients for each expert.
To validate this hypothesis, they "group" the experts and found that gating to the same tokens for 6 blocks (2 groups) overfits much more than calculating a new gating at every block, indicating that these additional gatings are what gives this model its additional performance.
They also tested varying the number of experts in MoE and found similar things to MAT.
Adding experts (while keeping all other parameters the same) increases overfitting and reduces the evaluation performance.
Perhaps we have to use Attention-MoE?
TL;DR: WideNet illustrates very well that MoE has both gradient and overfitting issues which can be improved by adding weight sharing.
Considering its curve, I'm unsure if Switch, with that NLP, has these problems at 1M tokens/step as WideNet uses a batch size of 4096 samples.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lucas Nestler

Lucas Nestler Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @_clashluke

15 Jul
I finally got around to playing with @RiversHaveWings's VQGAN+CLIP notebooks!
The first order of business was to try to reproduce @ak92501's beautiful samples. You can see the results of my journey below (seeds=0 and 123456)

1/5
To reasonably create these samples, I attempted to optimize the model by jitting it with TorchScript. After countless wrong attempts, it's finally 5x as fast as the baseline. (If you're using PyTorch, try JIT. You might want to follow my notebook for further optimizations.)

2/5
I also added new features, such as gaussian dropout and noise, which immediately improved the samples.
Below you can see the same prompt with different sample-wide noise (S) and per-item noise (I).

1) S=0.05, I=0.01
2) S=0.25, I=0.10
3) S=0.10, I=0.153
4) S=0.25, I=0.125

3/5
Read 6 tweets
18 May
This is major breakthrough 👇
We're now using only seq^2 (4Mi) elements for each attention tensor instead of batch*heads*seq^2 (128Gi) for a PanGu-Alpha-200B-sized model, without reducing the performance or ability to scale.
I'll implement it immediately in our GPT codebase and share its performance on 2B-equivalent models.

@Hanxiao_6, is the split across channels necessary? You briefly describe it as "effective". Is that on TPU?
I can't figure out what "small initialization" means.
I finally arrived at 0.02 / context_size, which gives the blue curve (500M body + 400M embedding).
It looks very promising, but still NaNs after just 3000 steps with lr=1e-5. Image
Read 19 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(