You may have seen surreal and absurd AI-generated images like these ones...
These are all generated with an AI tool known as DALL·E mini
Let's talk about the history of #dallemini, and also *how* it works! ↓↓↓🧵
First, let's clarify the different AI tools which many get confused about:
- DALL·E was an @OpenAI-developed AI project from Jan 2021
- DALL·E mini is a community-created project inspired by DALL·E
- DALL·E 2 is another @OpenAI-developed tool released in April (2/16)
@OpenAI DALL·E mini was actually originally developed about a year ago, back in July 2021.
During a programming competition organized by @huggingface (an AI company), @borisdayma & some community folks (including myself!) developed a neural network inspired by DALL·E & studied it (3/16)
DALL·E mini learns from *millions* of image-text caption pairs sourced from the Internet. (5/16)
@OpenAI@huggingface@borisdayma The first component is a neural language model. You may already be familiar with neural language models, like the famous GPT-3 model, which takes texts and produces more text.
DALL·E mini uses another type of neural language model known as "BART" (6/16)
It's worth realizing that the language models don't actually work with text directly but represent the text as a sequence of discrete values that map to text. (this is known as "tokenization") (7/16)
@OpenAI@huggingface@borisdayma In fact, it's worth pointing out that BART is technically considered what is known as a "sequence-to-sequence" neural network for this reason. It can take in any discrete sequence and output a corresponding discrete sequence depending on the task it is trained on. (8/16)
While we could consider each pixel as a separate discrete value, this is inefficient & doesn't scale well.
Instead we utilize another neural network to *learn* a mapping from an image to a sequence. (9/16)
@OpenAI@huggingface@borisdayma This neural network is known as VQGAN, which you may recognize from the VQGAN+CLIP technique used by another viral AI art tool (10/16)
@OpenAI@huggingface@borisdayma This VQGAN model learns from millions of images to learn a good mapping. A good mapping is one that can go from the sequence to a full image with minimal error. (11/16)
This is mainly since the VQGAN hasn't learned a good mapping to easily represent faces as a sequence of discrete values. (12/16)
@OpenAI@huggingface@borisdayma So to summarize, we use BART, a sequence-to-sequence neural network to map our text prompt (which is represented as a discrete sequence) to another discrete sequence which is then mapped to an actual image with the VQGAN. (13/16)
@OpenAI@huggingface@borisdayma Millions of images and corresponding captions were available as datasets to use for DALL·E mini learning. Then during learning process, the BART model is given a caption and is adjusted to reduce the difference between generated images and the actual corresponding images. (14/16)
Well that's oversimplification obviously, with many challenges when scaling up these huge models and using millions of images, but the basic concept is simple. (15/16)
This a diffusion model pipeline that goes beyond what AlphaFold2 did: predicting the structures of protein-molecule complexes containing DNA, RNA, ions, etc.
Google announces Med-Gemini, a family of Gemini models fine-tuned for medical tasks! 🔬
Achieves SOTA on 10 of the 14 benchmarks, spanning text, multimodal & long-context applications.
Surpasses GPT-4 on all benchmarks!
This paper is super exciting, let's dive in ↓
The team developed a variety of model variants. First let's talk about the models they developed for language tasks.
The finetuning dataset is quite similar to Med-PaLM2, except with one major difference:
self-training with search
(2/14)
The goal is to improve clinical reasoning and ability to use search results.
Synthetic chain-of-thought w/ and w/o search results in context are generated, incorrect preds are filtered out, the model is trained on those CoT, and then the synthetic CoT is regenerated
Before I continue, I want to mention this work was led by @RiversHaveWings, @StefanABaumann, @Birchlabs. @DanielZKaplan, @EnricoShippole were also valuable contributors. (2/11)
High-resolution image synthesis w/ diffusion is difficult without using multi-stage models (ex: latent diffusion). It's even more difficult for diffusion transformers due to O(n^2) scaling. So we want an easily scalable transformer arch for high-res image synthesis. (3/11)