Tanishq Mathew Abraham, Ph.D. Profile picture
Jun 13, 2022 16 tweets 15 min read Read on X
You may have seen surreal and absurd AI-generated images like these ones...

These are all generated with an AI tool known as DALL·E mini

Let's talk about the history of #dallemini, and also *how* it works! ↓↓↓🧵 Image
First, let's clarify the different AI tools which many get confused about:

- DALL·E was an @OpenAI-developed AI project from Jan 2021

- DALL·E mini is a community-created project inspired by DALL·E
- DALL·E 2 is another @OpenAI-developed tool released in April (2/16)
@OpenAI DALL·E mini was actually originally developed about a year ago, back in July 2021.

During a programming competition organized by @huggingface (an AI company), @borisdayma & some community folks (including myself!) developed a neural network inspired by DALL·E & studied it (3/16) ImageImage
@OpenAI @huggingface @borisdayma It was a great experience, we even won that competition!

Boris has now been continuing development on DALL·E mini, developing larger neural networks with even more data!

But how does it work?? (4/16) Image
@OpenAI @huggingface @borisdayma At the core of DALL·E mini are two components:
- language model
- image decoder

DALL·E mini learns from *millions* of image-text caption pairs sourced from the Internet. (5/16) Image
@OpenAI @huggingface @borisdayma The first component is a neural language model. You may already be familiar with neural language models, like the famous GPT-3 model, which takes texts and produces more text.

DALL·E mini uses another type of neural language model known as "BART" (6/16)
@OpenAI @huggingface @borisdayma But the BART model takes in text and produces images! How's that possible?

It's worth realizing that the language models don't actually work with text directly but represent the text as a sequence of discrete values that map to text. (this is known as "tokenization") (7/16)
@OpenAI @huggingface @borisdayma In fact, it's worth pointing out that BART is technically considered what is known as a "sequence-to-sequence" neural network for this reason. It can take in any discrete sequence and output a corresponding discrete sequence depending on the task it is trained on. (8/16) Image
@OpenAI @huggingface @borisdayma So what if we also represent images as a sequence of discrete values? 🤔

While we could consider each pixel as a separate discrete value, this is inefficient & doesn't scale well.

Instead we utilize another neural network to *learn* a mapping from an image to a sequence. (9/16) Image
@OpenAI @huggingface @borisdayma This neural network is known as VQGAN, which you may recognize from the VQGAN+CLIP technique used by another viral AI art tool (10/16)
@OpenAI @huggingface @borisdayma This VQGAN model learns from millions of images to learn a good mapping. A good mapping is one that can go from the sequence to a full image with minimal error. (11/16)
@OpenAI @huggingface @borisdayma As a separate note, you might have noticed that many of the #dallemini artworks have messed up faces 😄

This is mainly since the VQGAN hasn't learned a good mapping to easily represent faces as a sequence of discrete values. (12/16) Image
@OpenAI @huggingface @borisdayma So to summarize, we use BART, a sequence-to-sequence neural network to map our text prompt (which is represented as a discrete sequence) to another discrete sequence which is then mapped to an actual image with the VQGAN. (13/16) Image
@OpenAI @huggingface @borisdayma Millions of images and corresponding captions were available as datasets to use for DALL·E mini learning. Then during learning process, the BART model is given a caption and is adjusted to reduce the difference between generated images and the actual corresponding images. (14/16) Image
@OpenAI @huggingface @borisdayma It's that simple!

Well that's oversimplification obviously, with many challenges when scaling up these huge models and using millions of images, but the basic concept is simple. (15/16)
@OpenAI @huggingface @borisdayma Hope this thread was educational!

If you like this thread, please share!

Consider following me (@iScienceLuvr) for AI/ML-related content! 🙂

Also consider following the main DALL·E mini developer, @borisdayma! (16/16, end of thread)

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tanishq Mathew Abraham, Ph.D.

Tanishq Mathew Abraham, Ph.D. Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @iScienceLuvr

May 13
The livestream demo is not the only cool part about GPT-4o

Remember, GPT-4o is an end-to-end trained multimodal model!

No one is reading the GPT-4o blog post which highlights so many other cool features

SEE MORE FEATURES GPT-4o HAS ↓
First of all, GPT-4o is a much better language model. It's SOTA on a variety of LLM benchmarks:
And also good at chat arena evals
Read 11 tweets
May 8
AlphaFold3 is out!

This a diffusion model pipeline that goes beyond what AlphaFold2 did: predicting the structures of protein-molecule complexes containing DNA, RNA, ions, etc.

Blog post:
Paper:

A quick thread about the method↓blog.google/technology/ai/…
nature.com/articles/s4158…
AlphaFold2 was impactful but had one major limitation: it could only predict structures of proteins by itself.

In reality, proteins have various modifications, bind to other molecules, form complexes w/ DNA, RNA, etc.

Structure of these complexes can't be predicted by AF2
AF3 is similar to AF2, utilizing Template, MSA & Pairformer (similar to Evoformer from AF2) modules

However, amino acid + DNA/RNA/ion/ligand/post-translational modifications can be passed in unlike AF2

Also, the structure is directly generated with a diffusion model (3/11) Image
Read 12 tweets
Apr 30
Google announces Med-Gemini, a family of Gemini models fine-tuned for medical tasks! 🔬

Achieves SOTA on 10 of the 14 benchmarks, spanning text, multimodal & long-context applications.

Surpasses GPT-4 on all benchmarks!

This paper is super exciting, let's dive in ↓Image
The team developed a variety of model variants. First let's talk about the models they developed for language tasks.

The finetuning dataset is quite similar to Med-PaLM2, except with one major difference:

self-training with search

(2/14)Image
The goal is to improve clinical reasoning and ability to use search results.

Synthetic chain-of-thought w/ and w/o search results in context are generated, incorrect preds are filtered out, the model is trained on those CoT, and then the synthetic CoT is regenerated

(3/14)Image
Read 15 tweets
Jan 23
Happy to share a new paper I worked on!:

"Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers"

abs:
website:

A quick thread about the paper! ↓ (1/11) arxiv.org/abs/2401.11605
crowsonkb.github.io/hourglass-diff…
Image
Before I continue, I want to mention this work was led by @RiversHaveWings, @StefanABaumann, @Birchlabs. @DanielZKaplan, @EnricoShippole were also valuable contributors. (2/11)
High-resolution image synthesis w/ diffusion is difficult without using multi-stage models (ex: latent diffusion). It's even more difficult for diffusion transformers due to O(n^2) scaling. So we want an easily scalable transformer arch for high-res image synthesis. (3/11)
Read 13 tweets
Dec 26, 2023
Are you wondering how the new Mamba language model works?

Mamba is based on state-space models (SSMs), a new competitor to the Transformer architecture.

Here are 5 resources to help you learn about SSMs & Mamba! ↓↓↓
1. Mamba - a replacement for Transformers? by @SamuelAlbanie
Link →

Provides a short and quick overview of Mamba and the literature leading up to it.
Image
2. Structured State Space Models for Deep Sequence Modeling ( @_albertgu, CMU)

Link →

This is a comprehensive 1-hr lecture about deep SSMs from its inventor. Very clear and informative!
Image
Read 7 tweets
Oct 30, 2023
The Biden-Harris administration has issued an Executive Order on AI safety. This is a big one!

Based on the Fact Sheet, here are some of the interesting parts of the EO ↓
There is significant focus on evaluation and standards for AI systems, including @NIST developing red-teaming standards. Image
There is also focus on security, including specifically biosecurity and cybersecurity, and preventing AI from exacerbating these issues. Image
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(