Boris Dayma 🖍️ Profile picture
Apr 21, 2022 14 tweets 5 min read Read on X
I've been comparing a lot of transformer variants on large models (400M params): Post/Pre-LN, DeepNet, NormFormers, Swin v2, GLU variants, RMSNorm, Sandwich LN, with GELU, Swish, SmeLU…

More than 2,000h of total training time on TPU v3's 😯

Here are my findings 🤓
You must use a final LayerNorm in the decoder for any pre-LN architecture (they're always present in post-LN).

It also helps convergence to use one at the end of the encoder as well.
Don't use bias in dense layers.
It adds 15% of training time and hurts convergence.
If you use a post-LN model (LayerNorms after each residual connection), DeepNet improves a little bit stability.
I had the highest success with NormFormers.
The position of the LayerNorms is similar to Sandwich-LN (per Cogview) except in the attention block.
It shows however better stability.
I ran a bunch of NormFormer variants.
The paper suggests not using the scaled residual connection. I recommend not using the head scale either.
I don't use learnt scale in LN when followed by dense layers. It trains better with it but I think it's because it acts as a reduced lr.
GLU variants are great!

Even if they increase peak memory (and reduce your max batch size) for same amount of total parameters, they let the model train much better!
As activation function, use GeLU (more stable) or Swish (trains faster).
I didn't have great results with the new SmeLU function.
RMSNorm brings more stability to the training.

However for very long runs it plateau's before LayerNorm.
I didn't have great results with Swin v2 even after playing with different values of tau scale in the cosine attention.
I think the cosine attention makes the model learn much slower.
I also tested to only use Swin relative positions with other variants but it was not helpful.
Sinkformers are a bit slower to train and didn't improve the model.
Maybe it's because I can only use them in the encoder and not in the decoder due to causality.
You'll find more details in the report and a TLDR with links to relevant sections, full of interactive graphs and traceable runs (with diffs between runs).

Report here: wandb.ai/dalle-mini/dal…
Many thanks to @_arohan_ and Phil Wang for suggesting some of these ideas!
I have a lot more stuff to try in my backlog so will probably be updating this report in the future.

Also thanks to @pcuenq for running some of these experiments with me!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Boris Dayma 🖍️

Boris Dayma 🖍️ Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @borisdayma

Jun 20, 2024
Preliminary settings for the large CapPa model:
- Vision model: 328M params
- Text model: 348M params (67M embeds)

Going to train on TPU v5e-256 from TRC 😎

Model based on "Image Captioners Are Scalable Vision Learners Too" with a few tweaks.
Vision models are most often trained either in contrastive fashion on noisy dataset (CLIP) or as classifier on ImageNet.

Here we train a captioner on a noisy dataset.
The goal is to create a strong vision model (we discard the text model) to be used for any downstream task.
The paper proposes 2 training methods:
- Cap -> train on captioning only
- CapPa -> adds also a masked objective where we mask part or all of the text

The model trained will be CapPa with full masking 75% of the time 🤯

See CapPa paper: arxiv.org/abs/2306.07915
Read 11 tweets
Aug 30, 2022
Amazed to see the importance of selecting correctly Distributed Shampoo configuration for training the ViT-VQGAN 🤯

TLDR:
👉 Nesterov momentum brings more stability
👉 Optimal settings are problem specific Image
I trained a lot of different configurations following @_arohan_ suggestions that Nesterov momentum could potentially have an important impact for these types of problems that include a GAN loss.

I tried with/without Nesterov and experimented with a few values of beta1 and beta2. Image
All the Nesterov runs show much faster convergence and greater stability. Image
Read 5 tweets
Aug 9, 2022
Work on a better image encoder has started and is based on ViT-VQGAN.

We are experimenting with a few different configurations and are hopeful we'll get to something good 🤞

Current progress report here: wandb.ai/craiyon/vit-vq…
One challenge is there are tons of parameters to adjust: coefficient factors for losses (L2, codebook, lpips, stylegan, discriminator…), optimizer parameters, model architecture, codebook dim…

We added even more options: NormFormer, GLU variants, additional convolutions… 😅
It's tricky to explore the entire space of possibilities so we do quick experiments and try to take smart decisions.

Eventually we'd like to contribute a great f16 with codebook and a f8, not necessarily with codebook (we'll try KL loss) but low dimension.
Read 4 tweets
Jun 1, 2022
Time to talk about the biggest mistake I made while training DALLE-Mega 😥 Image
The model uses a TPU pod v3-256 generously provided by Google TRC program.

It's not all the time that you can get so great resources with the opportunity to train a 3B model.
The model was not scaling well in the past but I had done a lot of improvements (Shampoo, NormFormer, GLU, better init, etc).

After a quick learning rate search, training looked very promising for the first few days. Image
Read 18 tweets
Apr 19, 2022
Finally took the time to read DALLE 2 paper.

Process:
- text to CLIP image (no need to go through CLIP text)
- CLIP image to pixels

Here are my notes and how I could apply it to dalle-mini 👇
I'm not entirely sure that you need to go through CLIP image embeddings but the authors report a greater diversity of images by doing it.

Maybe the prior does not use a conditioning scale while the decoded does.
The advantage of using CLIP image embedding is for doing interpolation with another image.

I could do it the opposite (image to text embedding of my model) but not sure interpolation of text embedding is as good as image embedding.
Read 9 tweets
Feb 3, 2022
Impact of learning rate still amazes me!
I would have never expected this graph 🤯

Few interesting things to know:
First you get an immediate drop of loss when lowering learning rate.

So it can be interesting to end your training with a linear decay to 0 and see if you get something a bit better.
Then, despite decaying the learning rate, you can see that the slope/progress is still about the same while you could have wanted to wait until a plateau.

This is extremely hard to guess the right moment to start decaying.
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(