I always thought #StableDiffusion prompts needed the right combination of words. But byte-pair encoding can represent anything you can type, including math formulas and emojis. Turns out you don't need any words at all! Here's how and why this works...🧵

Prompt: e=mc^2 Image
Prompts are fed to stable diffusion as binary code, with each letter/symbol represented as several bytes. Then a "tokenizer" looks for commonly occurring spans of adjacent bytes and groups them into a single known "word". Stable diffusion only knows 49408 words.

Here's "🧛🦇🗡️" ImageImage
You might think 49408 is a lot. Well, it's not. Here's the first 1200 words in the vocabulary. They don't get you very far. The words are auto-selected by a simple algorithm and half are junk. And what are the weird "�" entries? We'll get back to them later... Image
Common english words, symbols, and *most* emojis are known to SD as a single "word."
Next, each "word" is replaced with a 512-dimensional "embedding vector". The result is a list of at most 77 such embedding vectors.

Prompt: 🧑‍🚀👽🛰️🌌 🔥🍄 Image
These vectors go into a large neural network that makes the image. All emojis are represented by fairly similar embedding vectors. In fact, most emojis lie closer to unrelated emojis than to any English word. Still, the model understands the unique meaning of each emoji. Image
Unfortunately, there's a LOT of possible unicode characters and words - too many to have a separate embedding vector for each. Remember those "�" things? When unusual stuff comes along it's broken into individual bytes and represented using these emergency "words".
But there's only 256 different bytes, so we can have a separate embedding vector for each byte. Stable Diffusion is trained on 2 billion captions, so it learns to recognize many byte sequences even if they aren't in the vocabulary of "common" words that get their own vector.
Let's look at some of the rejects. Unlike most emojis,🏯 and📜 are not commonly used enough to be part of the 49K-word vocabulary. The closest conventional word to 🏯 in embedding space is "yin" (as in "yin and yang"). The closest word to 📜 is "news".

Here's "🏯🔥🐉 📜" ImageImageImageImage
Emojis that represent writing implements are not widely used. 🖍 and 🖊 have to stay as raw bytes. But the neural net recognizes their byte sequences and associates them with artistic styles. In fact, you can control the style of an image by placing one in your prompt. ImageImageImage
Text tokenization is a topic that is often dismissed as tedious and boring, but I think it's weirdly fascinating. Maybe that says more about me than about tokenization, though. Hopefully some of you out there in Twitterland agree. Thanks for reading!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tom Goldstein

Tom Goldstein Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @tomgoldsteincs

Nov 1
My work on AI "invisibility cloaks" that suppress person detectors was on the Reddit front page last week! Now I've been approved to do an official "Ask me anything" on Reddit this Thurs. See you Nov 3rd at 12:30pm EST on reddit.com/r/IAmA/!
tinyurl.com/y2d4v29z
Some background: it is well-known that adversarial attacks work well on image *classifiers*, but *detectors* are much more robust. The goal of our cloak project was to see whether physical adversarial examples could defeat a person detector.
To do this, we "trained" an adversarial patch by loading images from the COCO dataset, detecting people in the images, rendering our pattern on the detected people, and then updating the patch (using SGD) so that the detector no longer found anyone. Image
Read 11 tweets
Aug 24
Diffusion models like #DALLE and #StableDiffusion are state of the art for image generation, yet our understanding of them is in its infancy. This thread introduces the basics of how diffusion models work, how we understand them, and why I think this understanding is broken.🧵
Diffusion models are powerful image generators, but they are built on two simple components: a function that degrades images by adding Gaussian noise, and a simple image restoration network for removing this noise.
We create training data for the restoration network by adding Gaussian noise to clean images. The model accepts a noisy image as input and spits out a cleaned image. We train by minimizing a loss that measures the L1 difference between the original image and the denoised output.
Read 23 tweets
Aug 18
Why have diffusion models displaced GANs so quickly? Consider the tale of the (very strange) first DALLE model. In 2021, diffusions were almost unheard of, yet the creators of DALLE had already rejected the GAN approach. Here’s why. 🧵
DALLE is an image model, but it was built like a language model. The model trained on image-caption pairs. Captions were encoded as 256 tokens. Images were broken into a 32x32 grid of patches, which were each encoded as a token. All tokens were merged into a single sequence. Image
A transformer-based "language" model was trained on these sequences, ignoring the fact that some tokens represent text and some represent patches. The model reads in a partial sequence of tokens, and predicts the next token in the sequence.
Read 9 tweets
Jul 13
SSIM has become a common loss function in computer vision. It is used to train monocular depth models for self-driving cars, invert GANs, and fit NeRF models to training images. The explosion of SSIM-based models raises a fundamental question: what the hell is SSIM? 🧵
SSIM measures the similarity between two images. Humans are insensitive to the absolute brightness/color of pixels, but very sensitive to the location of edges and textures. SSIM mimics human perception by focusing primarily on edge and textural similarities.
Here’s an example. The contrast adjustment between these two images of #IngridDaubechies makes them 20% different when measured using the 2-norm. But in the SSIM metric they are 98.5% similar (1.5% different).
Read 10 tweets
Jul 5
Just how much have language models grown in the last 4 years? Let's have a look. In 2018, the puny BERT “large” model premiered with a measly 354M parameters. It can be trained on a single 8xA100 node in 5 days. That costs $2K on AWS - almost free by LLM standards! 🧵
Then came Facebook’s equally tiny RoBERTa model. Built on BERT-large, but with mods for faster mixed-precision training, it completed 40 epochs on its beefed up training set using 1000 GPUs for a week. You could train this on the cloud for $350K. NBD.
GPT-3 has a modest but respectable 175B parameters. It was trained with roughly 1500 GPUs for 2 months. On AWS, you could train this for a cool $3M.
Read 9 tweets
Feb 4
"Plug-In" inversion directly produces images from ViTs and CNNs at the pixel level, with no GAN prior. We then see what networks really care about, not just what the GANs want us to see. Here's a few examples. First, I'll pull you in with these tugboats...
My student @aminjuun has been working like a dog on this project. This dog, specifically.
Lately, it feels like there's been a volcano of research on vision transformers. Here's what the ViTs think of that...
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(