You may have seen surreal and absurd AI-generated images like these ones...
These are all generated with an AI tool known as DALL·E mini
Let's talk about the history of #dallemini, and also *how* it works! ↓↓↓🧵
First, let's clarify the different AI tools which many get confused about:
- DALL·E was an @OpenAI-developed AI project from Jan 2021
- DALL·E mini is a community-created project inspired by DALL·E
- DALL·E 2 is another @OpenAI-developed tool released in April (2/16)
@OpenAI DALL·E mini was actually originally developed about a year ago, back in July 2021.
During a programming competition organized by @huggingface (an AI company), @borisdayma & some community folks (including myself!) developed a neural network inspired by DALL·E & studied it (3/16)
DALL·E mini learns from *millions* of image-text caption pairs sourced from the Internet. (5/16)
@OpenAI@huggingface@borisdayma The first component is a neural language model. You may already be familiar with neural language models, like the famous GPT-3 model, which takes texts and produces more text.
DALL·E mini uses another type of neural language model known as "BART" (6/16)
It's worth realizing that the language models don't actually work with text directly but represent the text as a sequence of discrete values that map to text. (this is known as "tokenization") (7/16)
@OpenAI@huggingface@borisdayma In fact, it's worth pointing out that BART is technically considered what is known as a "sequence-to-sequence" neural network for this reason. It can take in any discrete sequence and output a corresponding discrete sequence depending on the task it is trained on. (8/16)
While we could consider each pixel as a separate discrete value, this is inefficient & doesn't scale well.
Instead we utilize another neural network to *learn* a mapping from an image to a sequence. (9/16)
@OpenAI@huggingface@borisdayma This neural network is known as VQGAN, which you may recognize from the VQGAN+CLIP technique used by another viral AI art tool (10/16)
@OpenAI@huggingface@borisdayma This VQGAN model learns from millions of images to learn a good mapping. A good mapping is one that can go from the sequence to a full image with minimal error. (11/16)
This is mainly since the VQGAN hasn't learned a good mapping to easily represent faces as a sequence of discrete values. (12/16)
@OpenAI@huggingface@borisdayma So to summarize, we use BART, a sequence-to-sequence neural network to map our text prompt (which is represented as a discrete sequence) to another discrete sequence which is then mapped to an actual image with the VQGAN. (13/16)
@OpenAI@huggingface@borisdayma Millions of images and corresponding captions were available as datasets to use for DALL·E mini learning. Then during learning process, the BART model is given a caption and is adjusted to reduce the difference between generated images and the actual corresponding images. (14/16)
Well that's oversimplification obviously, with many challenges when scaling up these huge models and using millions of images, but the basic concept is simple. (15/16)
Awesome and surprising things you can do with Jupyter Notebooks ⬇
1. Write a full-fledged Python library!
You can write all of your code, documentation, & tests with Jupyter Notebooks & nbdev.fast.ai, all while maintaining best software practices and implementing CI/CD!
fastai deep learning library is entirely written in notebooks!
2. Create a blog!
Platforms like fastpages.fast.ai easily allow you to create blog posts from your Jupyter Notebooks, with the code cells and outputs in your post, and can even be made interactive.
The model is based on an autoregressive transformer (like DALL·E) combined with a VQGAN but utilizes several key tricks to improve the quality and also controllability of the generations. 2/10
One trick is the use of a segmentation map (referred to as a scene) and a VQGAN for the scene.
As you can see here, this provides more controllability to the generation process. 3/10
The first step is to explore the data, also known as EDA. Getting a feel for the data is important to be able to derive important insights that can help you. 👨💻
After exploring the data, the next step is to make a baseline solution. In this case, I had put together a quick pretrained baseline based on @Nils_Reimers's SentenceTransformers:
What matters most when training a neural network is how well it generalizes to unseen data.
For neural networks, it turns out there's a simple principle that can allow you to understand model generalization. (1/18)
A thread ↓
First let's formalize what generalization means.
We can say that the generalization gap is the difference between the loss for the training data and the loss for the unseen data taken from the same distribution. (2/18)
The loss itself depends on the parameters of the model, and we adjust the parameters to decrease the loss through gradient descent and reach a (local) minimum. (3/18)
In order to select the 3 winners, I needed to keep track of the entrants. I used the Twitter API to do so. I signed up for the Twitter API and followed the required steps: developer.twitter.com/en/portal/peti…