Boris Dayma 🖍️ Profile picture
🖍️ Founder of Craiyon 🥑 Author of dalle-mini
Jun 20, 2024 11 tweets 3 min read
Preliminary settings for the large CapPa model:
- Vision model: 328M params
- Text model: 348M params (67M embeds)

Going to train on TPU v5e-256 from TRC 😎

Model based on "Image Captioners Are Scalable Vision Learners Too" with a few tweaks. Vision models are most often trained either in contrastive fashion on noisy dataset (CLIP) or as classifier on ImageNet.

Here we train a captioner on a noisy dataset.
The goal is to create a strong vision model (we discard the text model) to be used for any downstream task.
Aug 30, 2022 5 tweets 2 min read
Amazed to see the importance of selecting correctly Distributed Shampoo configuration for training the ViT-VQGAN 🤯

TLDR:
👉 Nesterov momentum brings more stability
👉 Optimal settings are problem specific Image I trained a lot of different configurations following @_arohan_ suggestions that Nesterov momentum could potentially have an important impact for these types of problems that include a GAN loss.

I tried with/without Nesterov and experimented with a few values of beta1 and beta2. Image
Aug 9, 2022 4 tweets 1 min read
Work on a better image encoder has started and is based on ViT-VQGAN.

We are experimenting with a few different configurations and are hopeful we'll get to something good 🤞

Current progress report here: wandb.ai/craiyon/vit-vq… One challenge is there are tons of parameters to adjust: coefficient factors for losses (L2, codebook, lpips, stylegan, discriminator…), optimizer parameters, model architecture, codebook dim…

We added even more options: NormFormer, GLU variants, additional convolutions… 😅
Jun 1, 2022 18 tweets 5 min read
Time to talk about the biggest mistake I made while training DALLE-Mega 😥 Image The model uses a TPU pod v3-256 generously provided by Google TRC program.

It's not all the time that you can get so great resources with the opportunity to train a 3B model.
Apr 21, 2022 14 tweets 5 min read
I've been comparing a lot of transformer variants on large models (400M params): Post/Pre-LN, DeepNet, NormFormers, Swin v2, GLU variants, RMSNorm, Sandwich LN, with GELU, Swish, SmeLU…

More than 2,000h of total training time on TPU v3's 😯

Here are my findings 🤓 You must use a final LayerNorm in the decoder for any pre-LN architecture (they're always present in post-LN).

It also helps convergence to use one at the end of the encoder as well.
Apr 19, 2022 9 tweets 2 min read
Finally took the time to read DALLE 2 paper.

Process:
- text to CLIP image (no need to go through CLIP text)
- CLIP image to pixels

Here are my notes and how I could apply it to dalle-mini 👇 I'm not entirely sure that you need to go through CLIP image embeddings but the authors report a greater diversity of images by doing it.

Maybe the prior does not use a conditioning scale while the decoded does.
Feb 3, 2022 6 tweets 2 min read
Impact of learning rate still amazes me!
I would have never expected this graph 🤯

Few interesting things to know: First you get an immediate drop of loss when lowering learning rate.

So it can be interesting to end your training with a linear decay to 0 and see if you get something a bit better.
Dec 7, 2021 5 tweets 2 min read
🥑 are finally starting to be more consistent 🎉
About 13 days into the training! It's not obvious what is hard and what is easy for the model.

It definitely seems more interested in learning avocado armchairs first 😅
Jul 30, 2021 6 tweets 5 min read
DALL·E mini is now available 🥳🥑

Generate images from any text prompt! huggingface.co/spaces/flax-co… 🔍 Find out how it works in our report wandb.ai/dalle-mini/dal…