Tom Goldstein Profile picture
Jun 19 12 tweets 4 min read Twitter logo Read on Twitter
Training an LLM takes about 1 trillion words. That’s about 30,000 years of typing.
But where does this data come from?
And what does this have to do with the Reddit protests?
Here’s how OpenAI trains models on “the entire internet.” 🧵📜
Much of what we know about OpenAI is from urban legends. But the GPT3 paper does have a table showing their data sources. The cliché that LLMs are trained on “the whole internet” comes from the use of CommonCrawl. Image
CommonCrawl (CC) is a non-profit that scrapes the internet with bots and tries to record everything since 2008. 90% of CC is HTML, CSS, and scripts. The usable 10% contains junk that needs to be tossed out to clean the dataset.
CC is too big to clean by hand, but there's algorithmic hacks to remove garbage. Google’s C4 dataset is an auto-cleaned CC. Here’s the top sources. It contains a lot of straight news, but also some Breitbart, RT, and a tiny bit of Daily Stormer. washingtonpost.com/technology/int…
It's way better to clean data with humans instead of machines, but this costs a fortune. What if I told you people will rate your data samples for free? Now that's truly the stuff of legends. This is where Reddit comes in.
OpenAI’s WebText dataset was created by snatching hyperlinks from Reddit posts, but only if users gave them 3+ upvotes. OpenAI didn't distribute WebText because it contains copyrighted articles. It is mostly high-quality news articles and blog posts. Image
OpenAI has used Reddit before. For example, TL;DR summaries from Reddit were used to train older models. The frequent use of Reddit by OpenAI has been cited as one reason that Reddit recently started charging for data, causing protests from Reddit users.
nytimes.com/2023/04/18/tec…
Finally, many models use Wikipedia, ArXiv, etc. These release periodic data dumps to de-incentivize scraping. Some are already in CC, but they are often included again and up-weighted because of their quality. As huge as they may seem, each are <1% of a cleaned CC. Image
OpenAI’s most mysterious assets are “books.” Legends are all we have for these. Is Books2 from the (illegal?) LibGen server, which explains OpenAI’s silence on the issue? Is Books1 from Smashwords/Project Gutenberg? Maybe they’re written by gnomes. Who knows 🤷 ImageImage
And that’s it. The whole internet. Well, at least the publicly accessible, text-based internet. So where did OpenAI go from here?

There are sources of data that are not in CC, either because they are paywalled or because they are not text.
One legend is that OpenAI generated transcripts of top YouTube videos when they constructed the Whisper dataset. Youtube has millions of hours of high-quality video, containing billions of tokens.
openai.com/research/whisp…
The most recent OpenAI legend is that they hired an army of writers and code developers to make datasets from scratch. According to the legend, if you climb San Bruno mountain on a cold winter night, you can hear them deep within the valley…typing away.
semafor.com/article/01/27/…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tom Goldstein

Tom Goldstein Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @tomgoldsteincs

Jun 13
A common criticism of LLM watermarks is they can be removed by AI paraphrasing or human editing. Let's put this theory to the test! Can a watermark be automatically removed by GPT? Can a grad student do any better? The results surprised me 🧵
arxiv.org/pdf/2306.04634… Image
First, if you don’t remember how watermarks work, you might revisit my original post on this issue.
TL;DR The watermark is a subtle pattern embedded in LLM outputs that labels it as machine generated. High accuracy detection usually requires 50-ish words.
The experiment: We generated watermarked text using the Llama model, then asked a non-watermarked LLM (GPT-3.5) to re-write it. We did lots of prompt engineering to try to get rid of the watermark. Finally, we checked whether we could detect the watermark in the rewritten text.
Read 11 tweets
May 30
LLMs do many things more efficiently than humans. But there’s one thing humans still do WAY better than machines: learn. In this thread I compare the learning efficiency of machines to that of humans, and I use scaling laws to convert humans into equivalent LLMs. 🧵 Image
A typical human hears 20K words per day. By age five, a typical child should have heard 37 million words. A 50 year old should have heard 370M words.
greatschools.org/gk/articles/wo…
Let’s compare that to an LLM. Meta’s Llama model is proficient in English and elementary math. Llama was trained on 1.4 trillion tokens. That’s 3,800 times more tokens than a human has verbally exchanged at age 50.
Read 12 tweets
May 2
It is widely thought that neural networks generalize because of implicit regularization of gradient descent. Today at #ICLR2023 we show new evidence to the contrary. We train with gradient-free optimizers and observe generalization competitive with SGD.
openreview.net/forum?id=QC10R…
An alternative theory of generalization is the "volume hypothesis": Good minima are flat, and occupy more volume than bad minima. For this reason, optimizers are more likely to land in the large/wide basins around good minima, and less likely to land in small/sharp bad minima. Image
One of the optimizers we test is a “guess and check” (GAC) optimizer that samples parameters uniformly from a hypercube and checks whether the loss is low. If so, optimization terminates. If not, it throws away the parameters and samples again until it finds a low loss.
Read 7 tweets
Mar 13
Here's the real story of #SiliconValleyBank, as told the boring way through tedious analysis of balance sheets and SEC filings 🧵
Throughout 2021 startups were raising money from VCs and stashing it in SVB. Deposits increased from $102B to $189B. That's an 85% change in one year. Wow! Image
Most news sources claim that SVB stashed this money in relatively safe treasury securities. This is something important that most media sources got wrong.
forbes.com/sites/billcone… Image
Read 13 tweets
Feb 27
If you work for a US university, you have probably noticed the rollout of strict new policies mandating disclosures and approvals for funding, consulting, and COIs, and also threats of legal action for non-compliance. Here’s why this is happening now 🧵
Let's start at the beginning. In 2018, the DOJ implemented its new “China Policy.” The stated purpose of this program was to combat the perceived fears of Chinese espionage operations inside US Universities.
fbi.gov/investigate/co…
In practice, the DOJ used the policy to investigate people of Chinese descent, usually without evidence of espionage. Many people were arrested and jailed with no formal charges at all.
reuters.com/world/us/trump…
Read 15 tweets
Feb 8
We rack our brains making prompts for #StableDiffusion and Language Models. But a lot of prompt engineering can be done *automatically* using simple gradient-based optimization. And the cold calculating efficiency of the machine crushes human creativity.
Prompts made easy (PEZ) is a gradient optimizer for text. It can convert images into prompts for Stable Diffusion, or it can learn a hard prompt for an LLM task. The method uses ideas from the binary neural nets literature that mashup continuous and discrete optimization.
PEZ can even create a prompt to represent a face...as the hypothetical offspring of multiple celebrities ¯\_(ツ)_/¯
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(