With fewer parameters, layers, and lower training time, they achieve a 3.2% (relative) lower top-1 error.
Their experiments also illustrate that ViT by itself can learn with weight sharing, which is incredibly exciting.
ALBERT (arxiv.org/abs/1909.11942) proposed the same thing for language models two years ago and found that adding weight sharing reduces parameter (and with that memory) consumption significantly but makes the model slower train.
Just like WideNet, they don't share LayerNorm
WideNet investigated the same thing by checking whether MoE helps, and if so, how much.
The unexpected thing here is that WideNet-L performs better with parameter sharing. This could be because of the cleaner and stronger gradients for each expert.
To validate this hypothesis, they "group" the experts and found that gating to the same tokens for 6 blocks (2 groups) overfits much more than calculating a new gating at every block, indicating that these additional gatings are what gives this model its additional performance.
They also tested varying the number of experts in MoE and found similar things to MAT.
Adding experts (while keeping all other parameters the same) increases overfitting and reduces the evaluation performance.
Perhaps we have to use Attention-MoE?
TL;DR: WideNet illustrates very well that MoE has both gradient and overfitting issues which can be improved by adding weight sharing.
Considering its curve, I'm unsure if Switch, with that NLP, has these problems at 1M tokens/step as WideNet uses a batch size of 4096 samples.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
I finally got around to playing with @RiversHaveWings's VQGAN+CLIP notebooks!
The first order of business was to try to reproduce @ak92501's beautiful samples. You can see the results of my journey below (seeds=0 and 123456)
To reasonably create these samples, I attempted to optimize the model by jitting it with TorchScript. After countless wrong attempts, it's finally 5x as fast as the baseline. (If you're using PyTorch, try JIT. You might want to follow my notebook for further optimizations.)
2/5
I also added new features, such as gaussian dropout and noise, which immediately improved the samples.
Below you can see the same prompt with different sample-wide noise (S) and per-item noise (I).
This is major breakthrough 👇
We're now using only seq^2 (4Mi) elements for each attention tensor instead of batch*heads*seq^2 (128Gi) for a PanGu-Alpha-200B-sized model, without reducing the performance or ability to scale.
I'll implement it immediately in our GPT codebase and share its performance on 2B-equivalent models.
@Hanxiao_6, is the split across channels necessary? You briefly describe it as "effective". Is that on TPU?
I can't figure out what "small initialization" means.
I finally arrived at 0.02 / context_size, which gives the blue curve (500M body + 400M embedding).
It looks very promising, but still NaNs after just 3000 steps with lr=1e-5.