I always thought #StableDiffusion prompts needed the right combination of words. But byte-pair encoding can represent anything you can type, including math formulas and emojis. Turns out you don't need any words at all! Here's how and why this works...🧵
Prompt: e=mc^2
Prompts are fed to stable diffusion as binary code, with each letter/symbol represented as several bytes. Then a "tokenizer" looks for commonly occurring spans of adjacent bytes and groups them into a single known "word". Stable diffusion only knows 49408 words.
Here's "🧛🦇🗡️"
You might think 49408 is a lot. Well, it's not. Here's the first 1200 words in the vocabulary. They don't get you very far. The words are auto-selected by a simple algorithm and half are junk. And what are the weird "�" entries? We'll get back to them later...
Common english words, symbols, and *most* emojis are known to SD as a single "word."
Next, each "word" is replaced with a 512-dimensional "embedding vector". The result is a list of at most 77 such embedding vectors.
Prompt: 🧑🚀👽🛰️🌌 🔥🍄
These vectors go into a large neural network that makes the image. All emojis are represented by fairly similar embedding vectors. In fact, most emojis lie closer to unrelated emojis than to any English word. Still, the model understands the unique meaning of each emoji.
Unfortunately, there's a LOT of possible unicode characters and words - too many to have a separate embedding vector for each. Remember those "�" things? When unusual stuff comes along it's broken into individual bytes and represented using these emergency "words".
But there's only 256 different bytes, so we can have a separate embedding vector for each byte. Stable Diffusion is trained on 2 billion captions, so it learns to recognize many byte sequences even if they aren't in the vocabulary of "common" words that get their own vector.
Let's look at some of the rejects. Unlike most emojis,🏯 and📜 are not commonly used enough to be part of the 49K-word vocabulary. The closest conventional word to 🏯 in embedding space is "yin" (as in "yin and yang"). The closest word to 📜 is "news".
Here's "🏯🔥🐉 📜"
Emojis that represent writing implements are not widely used. 🖍 and 🖊 have to stay as raw bytes. But the neural net recognizes their byte sequences and associates them with artistic styles. In fact, you can control the style of an image by placing one in your prompt.
Text tokenization is a topic that is often dismissed as tedious and boring, but I think it's weirdly fascinating. Maybe that says more about me than about tokenization, though. Hopefully some of you out there in Twitterland agree. Thanks for reading!
• • •
Missing some Tweet in this thread? You can try to
force a refresh
The Llama2 model is pretty impressive. Human evaluators rank it slightly *better* than ChatGPT on a range of things (excluding code and reasoning).
Here's a short TL;DR on what Meta did to improve the state of the art 🧵
Llama1: Small models (7B & 13B) were trained on 1 trillion tokens. Large models saw 1.4T tokens.
Llama2: All models trained on 2T tokens. This means the small models are "over trained" beyond what the scaling laws recommend, resulting in great performance for small models!
As a result of the long training runs, Llama2 beats other major open-source models at most academic benchmarks. Their 7B model is WAY better than other 7B options on all tasks except code.
Nvidia’s AI products follow a weird reverse Moore’s law: every two years, you get half as many FLOPS for your money. This is the opposite of the rest of the chip market 📈
With the H100 release, Nvidia had to reverse course.
A 🧵 on Nvidia losing its grip on the GPU market.
Let’s focus in on the machine learning GPUs. You can see the value drop over time, until the H100 created an uptick. Note: I’m using today’s price for each card, but a similar downward trend also holds for the release prices.
The drop is because of monopoly power and clever market segmentation.
Example: The “server-grade” V100 is a minor variant of the 2080ti gaming card. Nvidia sells it to institutions instead of gamers, charging 5X more for the V100. This means huge profits. lambdalabs.com/blog/best-gpu-…
Training an LLM takes about 1 trillion words. That’s about 30,000 years of typing.
But where does this data come from?
And what does this have to do with the Reddit protests?
Here’s how OpenAI trains models on “the entire internet.” 🧵📜
Much of what we know about OpenAI is from urban legends. But the GPT3 paper does have a table showing their data sources. The cliché that LLMs are trained on “the whole internet” comes from the use of CommonCrawl.
CommonCrawl (CC) is a non-profit that scrapes the internet with bots and tries to record everything since 2008. 90% of CC is HTML, CSS, and scripts. The usable 10% contains junk that needs to be tossed out to clean the dataset.
A common criticism of LLM watermarks is they can be removed by AI paraphrasing or human editing. Let's put this theory to the test! Can a watermark be automatically removed by GPT? Can a grad student do any better? The results surprised me 🧵 arxiv.org/pdf/2306.04634…
First, if you don’t remember how watermarks work, you might revisit my original post on this issue.
TL;DR The watermark is a subtle pattern embedded in LLM outputs that labels it as machine generated. High accuracy detection usually requires 50-ish words.
The experiment: We generated watermarked text using the Llama model, then asked a non-watermarked LLM (GPT-3.5) to re-write it. We did lots of prompt engineering to try to get rid of the watermark. Finally, we checked whether we could detect the watermark in the rewritten text.
LLMs do many things more efficiently than humans. But there’s one thing humans still do WAY better than machines: learn. In this thread I compare the learning efficiency of machines to that of humans, and I use scaling laws to convert humans into equivalent LLMs. 🧵
A typical human hears 20K words per day. By age five, a typical child should have heard 37 million words. A 50 year old should have heard 370M words. greatschools.org/gk/articles/wo…
Let’s compare that to an LLM. Meta’s Llama model is proficient in English and elementary math. Llama was trained on 1.4 trillion tokens. That’s 3,800 times more tokens than a human has verbally exchanged at age 50.