Training an LLM takes about 1 trillion words. That’s about 30,000 years of typing.
But where does this data come from?
And what does this have to do with the Reddit protests?
Here’s how OpenAI trains models on “the entire internet.” 🧵📜
Much of what we know about OpenAI is from urban legends. But the GPT3 paper does have a table showing their data sources. The cliché that LLMs are trained on “the whole internet” comes from the use of CommonCrawl.
CommonCrawl (CC) is a non-profit that scrapes the internet with bots and tries to record everything since 2008. 90% of CC is HTML, CSS, and scripts. The usable 10% contains junk that needs to be tossed out to clean the dataset.
CC is too big to clean by hand, but there's algorithmic hacks to remove garbage. Google’s C4 dataset is an auto-cleaned CC. Here’s the top sources. It contains a lot of straight news, but also some Breitbart, RT, and a tiny bit of Daily Stormer. washingtonpost.com/technology/int…
It's way better to clean data with humans instead of machines, but this costs a fortune. What if I told you people will rate your data samples for free? Now that's truly the stuff of legends. This is where Reddit comes in.
OpenAI’s WebText dataset was created by snatching hyperlinks from Reddit posts, but only if users gave them 3+ upvotes. OpenAI didn't distribute WebText because it contains copyrighted articles. It is mostly high-quality news articles and blog posts.
OpenAI has used Reddit before. For example, TL;DR summaries from Reddit were used to train older models. The frequent use of Reddit by OpenAI has been cited as one reason that Reddit recently started charging for data, causing protests from Reddit users. nytimes.com/2023/04/18/tec…
Finally, many models use Wikipedia, ArXiv, etc. These release periodic data dumps to de-incentivize scraping. Some are already in CC, but they are often included again and up-weighted because of their quality. As huge as they may seem, each are <1% of a cleaned CC.
OpenAI’s most mysterious assets are “books.” Legends are all we have for these. Is Books2 from the (illegal?) LibGen server, which explains OpenAI’s silence on the issue? Is Books1 from Smashwords/Project Gutenberg? Maybe they’re written by gnomes. Who knows 🤷
And that’s it. The whole internet. Well, at least the publicly accessible, text-based internet. So where did OpenAI go from here?
There are sources of data that are not in CC, either because they are paywalled or because they are not text.
One legend is that OpenAI generated transcripts of top YouTube videos when they constructed the Whisper dataset. Youtube has millions of hours of high-quality video, containing billions of tokens. openai.com/research/whisp…
The most recent OpenAI legend is that they hired an army of writers and code developers to make datasets from scratch. According to the legend, if you climb San Bruno mountain on a cold winter night, you can hear them deep within the valley…typing away. semafor.com/article/01/27/…
• • •
Missing some Tweet in this thread? You can try to
force a refresh
A common criticism of LLM watermarks is they can be removed by AI paraphrasing or human editing. Let's put this theory to the test! Can a watermark be automatically removed by GPT? Can a grad student do any better? The results surprised me 🧵 arxiv.org/pdf/2306.04634…
First, if you don’t remember how watermarks work, you might revisit my original post on this issue.
TL;DR The watermark is a subtle pattern embedded in LLM outputs that labels it as machine generated. High accuracy detection usually requires 50-ish words.
The experiment: We generated watermarked text using the Llama model, then asked a non-watermarked LLM (GPT-3.5) to re-write it. We did lots of prompt engineering to try to get rid of the watermark. Finally, we checked whether we could detect the watermark in the rewritten text.
LLMs do many things more efficiently than humans. But there’s one thing humans still do WAY better than machines: learn. In this thread I compare the learning efficiency of machines to that of humans, and I use scaling laws to convert humans into equivalent LLMs. 🧵
A typical human hears 20K words per day. By age five, a typical child should have heard 37 million words. A 50 year old should have heard 370M words. greatschools.org/gk/articles/wo…
Let’s compare that to an LLM. Meta’s Llama model is proficient in English and elementary math. Llama was trained on 1.4 trillion tokens. That’s 3,800 times more tokens than a human has verbally exchanged at age 50.
It is widely thought that neural networks generalize because of implicit regularization of gradient descent. Today at #ICLR2023 we show new evidence to the contrary. We train with gradient-free optimizers and observe generalization competitive with SGD. openreview.net/forum?id=QC10R…
An alternative theory of generalization is the "volume hypothesis": Good minima are flat, and occupy more volume than bad minima. For this reason, optimizers are more likely to land in the large/wide basins around good minima, and less likely to land in small/sharp bad minima.
One of the optimizers we test is a “guess and check” (GAC) optimizer that samples parameters uniformly from a hypercube and checks whether the loss is low. If so, optimization terminates. If not, it throws away the parameters and samples again until it finds a low loss.
Here's the real story of #SiliconValleyBank, as told the boring way through tedious analysis of balance sheets and SEC filings 🧵
Throughout 2021 startups were raising money from VCs and stashing it in SVB. Deposits increased from $102B to $189B. That's an 85% change in one year. Wow!
Most news sources claim that SVB stashed this money in relatively safe treasury securities. This is something important that most media sources got wrong. forbes.com/sites/billcone…
If you work for a US university, you have probably noticed the rollout of strict new policies mandating disclosures and approvals for funding, consulting, and COIs, and also threats of legal action for non-compliance. Here’s why this is happening now 🧵
Let's start at the beginning. In 2018, the DOJ implemented its new “China Policy.” The stated purpose of this program was to combat the perceived fears of Chinese espionage operations inside US Universities. fbi.gov/investigate/co…
In practice, the DOJ used the policy to investigate people of Chinese descent, usually without evidence of espionage. Many people were arrested and jailed with no formal charges at all. reuters.com/world/us/trump…
We rack our brains making prompts for #StableDiffusion and Language Models. But a lot of prompt engineering can be done *automatically* using simple gradient-based optimization. And the cold calculating efficiency of the machine crushes human creativity.
Prompts made easy (PEZ) is a gradient optimizer for text. It can convert images into prompts for Stable Diffusion, or it can learn a hard prompt for an LLM task. The method uses ideas from the binary neural nets literature that mashup continuous and discrete optimization.
PEZ can even create a prompt to represent a face...as the hypothetical offspring of multiple celebrities ¯\_(ツ)_/¯