A quick thread on "How DALL-E 2, Imagen and Parti Architectures Differ" with breakdown into comparable modules, annotated with size 🧵 #dalle2#imagen#parti
* figures taken from corresponding papers with slight modification
* parts used for training only are greyed out
By now we know that
- DALL-E & Imagen = diffusion; Parti = autoregressive
- Imagen & Parti use generic text encoders; DALLE uses CLIP enc
But in fact, one version of Imagen also used CLIP, one version of DALL-E also had AR prior. So there are more connections than it seemed.
If we break each architecture down into *modules*, the similarity/comparability is even more clear.
First of all, they all have a "text encoder", but differ in types and sizes:
- DALL-E uses CLIP text encoder
- Imagen uses T5-XXL
- Parti uses a generic transformer
Then, they all convert text embeddings into image embeddings.
- DALL-E uses "prior + decoder" components, where the prior can be either AR or diffusion
- Imagen uses pure diffusion
- Parti uses a transformer decoder + transformer-based tokenizer (pretrained and finetuned)
Lastly, they all have a "super-resolution" block that upsamples generated images from 64x64 to 256x256, all the way to 1024x1024.
- DALL-E uses diffusion, 700M from 64 -> 256, 300M from 256 -> 1024
- Imagen uses diffusion, 600M + 300M
- Parti uses convolution, 15M + 30M (LOL)
End of thread. All of them are amazing! 😆
This is a deeper dive from an earlier tweet of mine that was well circulated but lacked clarity.
(Some numbers are estimates from descriptions in paper. Please send in corrections if you think they are off.)
Favorite #NeurIPS2020 presentations and posters this year
PS: heavily biased by what I happened to catch and whom I happened to talk to
PPS: still catching up on talks so the list is rather incomplete and I'd hope to grow
PPPS: with contributions from @ml_collective members