You may have seen surreal and absurd AI-generated images like these ones...
These are all generated with an AI tool known as DALL·E mini
Let's talk about the history of #dallemini, and also *how* it works! ↓↓↓🧵
First, let's clarify the different AI tools which many get confused about:
- DALL·E was an @OpenAI-developed AI project from Jan 2021
- DALL·E mini is a community-created project inspired by DALL·E
- DALL·E 2 is another @OpenAI-developed tool released in April (2/16)
@OpenAI DALL·E mini was actually originally developed about a year ago, back in July 2021.
During a programming competition organized by @huggingface (an AI company), @borisdayma & some community folks (including myself!) developed a neural network inspired by DALL·E & studied it (3/16)
DALL·E mini learns from *millions* of image-text caption pairs sourced from the Internet. (5/16)
@OpenAI@huggingface@borisdayma The first component is a neural language model. You may already be familiar with neural language models, like the famous GPT-3 model, which takes texts and produces more text.
DALL·E mini uses another type of neural language model known as "BART" (6/16)
It's worth realizing that the language models don't actually work with text directly but represent the text as a sequence of discrete values that map to text. (this is known as "tokenization") (7/16)
@OpenAI@huggingface@borisdayma In fact, it's worth pointing out that BART is technically considered what is known as a "sequence-to-sequence" neural network for this reason. It can take in any discrete sequence and output a corresponding discrete sequence depending on the task it is trained on. (8/16)
While we could consider each pixel as a separate discrete value, this is inefficient & doesn't scale well.
Instead we utilize another neural network to *learn* a mapping from an image to a sequence. (9/16)
@OpenAI@huggingface@borisdayma This neural network is known as VQGAN, which you may recognize from the VQGAN+CLIP technique used by another viral AI art tool (10/16)
@OpenAI@huggingface@borisdayma This VQGAN model learns from millions of images to learn a good mapping. A good mapping is one that can go from the sequence to a full image with minimal error. (11/16)
This is mainly since the VQGAN hasn't learned a good mapping to easily represent faces as a sequence of discrete values. (12/16)
@OpenAI@huggingface@borisdayma So to summarize, we use BART, a sequence-to-sequence neural network to map our text prompt (which is represented as a discrete sequence) to another discrete sequence which is then mapped to an actual image with the VQGAN. (13/16)
@OpenAI@huggingface@borisdayma Millions of images and corresponding captions were available as datasets to use for DALL·E mini learning. Then during learning process, the BART model is given a caption and is adjusted to reduce the difference between generated images and the actual corresponding images. (14/16)
Well that's oversimplification obviously, with many challenges when scaling up these huge models and using millions of images, but the basic concept is simple. (15/16)
A new startup, Inception Labs, has released Mercury Coder, "the first commercial-scale diffusion large language model"
It's 5-10x faster than current gen LLMs, providing high-quality responses at low costs.
And you can try it now!
The performance is similar to small frontier models while achieving a throughput of ~1000 tokens/sec... on H100s! Reaching this level of throughput for autoregressive LLMs typically requires specialized chips.
It's currently tied for second place on Copilot Arena!
Cleo was an account on Math Stack Exchange that was infamous for dropping the answer to the most difficult integrals with no explanation...
often mere minutes after the question was asked!!
For years, no one knew who Cleo was, UNTIL NOW!
People noticed that the same few people were interacting with Cleo (asking the questions Cleo answered, commenting, etc.), a couple of them only active at the same time as Cleo as well.
People were wondering maybe someone is controlling all these accounts as alts
One of the accounts, Laila Podlesny, had an email address associated with it, and by trying to fake log into the Gmail and obtaining the backup recovery email, someone figured out that Vladimir Reshetnikov was in control of Laila Podlesny.
Based on other ineractions from Vladimir on Math.SE, it seemed likely he controlled Cleo, Laila, and couple other accounts as well.
This a diffusion model pipeline that goes beyond what AlphaFold2 did: predicting the structures of protein-molecule complexes containing DNA, RNA, ions, etc.
Google announces Med-Gemini, a family of Gemini models fine-tuned for medical tasks! 🔬
Achieves SOTA on 10 of the 14 benchmarks, spanning text, multimodal & long-context applications.
Surpasses GPT-4 on all benchmarks!
This paper is super exciting, let's dive in ↓
The team developed a variety of model variants. First let's talk about the models they developed for language tasks.
The finetuning dataset is quite similar to Med-PaLM2, except with one major difference:
self-training with search
(2/14)
The goal is to improve clinical reasoning and ability to use search results.
Synthetic chain-of-thought w/ and w/o search results in context are generated, incorrect preds are filtered out, the model is trained on those CoT, and then the synthetic CoT is regenerated
Before I continue, I want to mention this work was led by @RiversHaveWings, @StefanABaumann, @Birchlabs. @DanielZKaplan, @EnricoShippole were also valuable contributors. (2/11)
High-resolution image synthesis w/ diffusion is difficult without using multi-stage models (ex: latent diffusion). It's even more difficult for diffusion transformers due to O(n^2) scaling. So we want an easily scalable transformer arch for high-res image synthesis. (3/11)