Preliminary settings for the large CapPa model:
- Vision model: 328M params
- Text model: 348M params (67M embeds)
Going to train on TPU v5e-256 from TRC 😎
Model based on "Image Captioners Are Scalable Vision Learners Too" with a few tweaks.
Vision models are most often trained either in contrastive fashion on noisy dataset (CLIP) or as classifier on ImageNet.
Here we train a captioner on a noisy dataset.
The goal is to create a strong vision model (we discard the text model) to be used for any downstream task.
Aug 30, 2022 • 5 tweets • 2 min read
Amazed to see the importance of selecting correctly Distributed Shampoo configuration for training the ViT-VQGAN 🤯
TLDR:
👉 Nesterov momentum brings more stability
👉 Optimal settings are problem specific
I trained a lot of different configurations following @_arohan_ suggestions that Nesterov momentum could potentially have an important impact for these types of problems that include a GAN loss.
I tried with/without Nesterov and experimented with a few values of beta1 and beta2.
Aug 9, 2022 • 4 tweets • 1 min read
Work on a better image encoder has started and is based on ViT-VQGAN.
We are experimenting with a few different configurations and are hopeful we'll get to something good 🤞
Current progress report here: wandb.ai/craiyon/vit-vq…
One challenge is there are tons of parameters to adjust: coefficient factors for losses (L2, codebook, lpips, stylegan, discriminator…), optimizer parameters, model architecture, codebook dim…
We added even more options: NormFormer, GLU variants, additional convolutions… 😅
Jun 1, 2022 • 18 tweets • 5 min read
Time to talk about the biggest mistake I made while training DALLE-Mega 😥
The model uses a TPU pod v3-256 generously provided by Google TRC program.
It's not all the time that you can get so great resources with the opportunity to train a 3B model.
Apr 21, 2022 • 14 tweets • 5 min read
I've been comparing a lot of transformer variants on large models (400M params): Post/Pre-LN, DeepNet, NormFormers, Swin v2, GLU variants, RMSNorm, Sandwich LN, with GELU, Swish, SmeLU…
More than 2,000h of total training time on TPU v3's 😯
Here are my findings 🤓
You must use a final LayerNorm in the decoder for any pre-LN architecture (they're always present in post-LN).
It also helps convergence to use one at the end of the encoder as well.
Apr 19, 2022 • 9 tweets • 2 min read
Finally took the time to read DALLE 2 paper.
Process:
- text to CLIP image (no need to go through CLIP text)
- CLIP image to pixels
Here are my notes and how I could apply it to dalle-mini 👇
I'm not entirely sure that you need to go through CLIP image embeddings but the authors report a greater diversity of images by doing it.
Maybe the prior does not use a conditioning scale while the decoded does.
Feb 3, 2022 • 6 tweets • 2 min read
Impact of learning rate still amazes me!
I would have never expected this graph 🤯
Few interesting things to know:
First you get an immediate drop of loss when lowering learning rate.
So it can be interesting to end your training with a linear decay to 0 and see if you get something a bit better.
Dec 7, 2021 • 5 tweets • 2 min read
🥑 are finally starting to be more consistent 🎉
About 13 days into the training!
It's not obvious what is hard and what is easy for the model.
It definitely seems more interested in learning avocado armchairs first 😅