Latest Twitter Threads by @borisdayma on Thread Reader App

Jun 20, 2024 • 11 tweets • 3 min read

Preliminary settings for the large CapPa model:
- Vision model: 328M params
- Text model: 348M params (67M embeds)

Going to train on TPU v5e-256 from TRC 😎

Model based on "Image Captioners Are Scalable Vision Learners Too" with a few tweaks. Vision models are most often trained either in contrastive fashion on noisy dataset (CLIP) or as classifier on ImageNet.

Here we train a captioner on a noisy dataset.
The goal is to create a strong vision model (we discard the text model) to be used for any downstream task.

Aug 30, 2022 • 5 tweets • 2 min read

Amazed to see the importance of selecting correctly Distributed Shampoo configuration for training the ViT-VQGAN 🤯

TLDR:
👉 Nesterov momentum brings more stability
👉 Optimal settings are problem specific

I trained a lot of different configurations following @_arohan_ suggestions that Nesterov momentum could potentially have an important impact for these types of problems that include a GAN loss.

I tried with/without Nesterov and experimented with a few values of beta1 and beta2.

Aug 9, 2022 • 4 tweets • 1 min read

Work on a better image encoder has started and is based on ViT-VQGAN.

We are experimenting with a few different configurations and are hopeful we'll get to something good 🤞

Current progress report here: wandb.ai/craiyon/vit-vq… One challenge is there are tons of parameters to adjust: coefficient factors for losses (L2, codebook, lpips, stylegan, discriminator…), optimizer parameters, model architecture, codebook dim…

We added even more options: NormFormer, GLU variants, additional convolutions… 😅

Jun 1, 2022 • 18 tweets • 5 min read

Time to talk about the biggest mistake I made while training DALLE-Mega 😥

The model uses a TPU pod v3-256 generously provided by Google TRC program.

It's not all the time that you can get so great resources with the opportunity to train a 3B model.

Apr 21, 2022 • 14 tweets • 5 min read

I've been comparing a lot of transformer variants on large models (400M params): Post/Pre-LN, DeepNet, NormFormers, Swin v2, GLU variants, RMSNorm, Sandwich LN, with GELU, Swish, SmeLU…

More than 2,000h of total training time on TPU v3's 😯

Here are my findings 🤓 You must use a final LayerNorm in the decoder for any pre-LN architecture (they're always present in post-LN).

It also helps convergence to use one at the end of the encoder as well.

Apr 19, 2022 • 9 tweets • 2 min read

Finally took the time to read DALLE 2 paper.

Process:
- text to CLIP image (no need to go through CLIP text)
- CLIP image to pixels

Here are my notes and how I could apply it to dalle-mini 👇

I'm not entirely sure that you need to go through CLIP image embeddings but the authors report a greater diversity of images by doing it.

Maybe the prior does not use a conditioning scale while the decoded does.

Feb 3, 2022 • 6 tweets • 2 min read

Impact of learning rate still amazes me!
I would have never expected this graph 🤯

Few interesting things to know:

First you get an immediate drop of loss when lowering learning rate.

So it can be interesting to end your training with a linear decay to 0 and see if you get something a bit better.

Dec 7, 2021 • 5 tweets • 2 min read

🥑 are finally starting to be more consistent 🎉
About 13 days into the training!

It's not obvious what is hard and what is easy for the model.

It definitely seems more interested in learning avocado armchairs first 😅

Jul 30, 2021 • 6 tweets • 5 min read

DALL·E mini is now available 🥳🥑

Generate images from any text prompt! huggingface.co/spaces/flax-co…

🔍 Find out how it works in our report wandb.ai/dalle-mini/dal…

Share this page!

Enter URL or ID to Unroll