Suppose you wanted to train a world-class GPT model, just like OpenAI. How? You have no data.

Now you do. Now everyone does.

Presenting "books3", aka "all of bibliotik"

- 196,640 books
- in plain .txt
- reliable, direct download, for years:…

thread 👇
I wrote up some details here:…

In OpenAI's papers on GPT-2 and 3, you'll notice references to datasets named "books1" and "books2".

books1 appears to be bookcorpus, or similar.

But OpenAI will not release information about books2; a crucial mystery.
We suspect OpenAI's books2 dataset might be "all of libgen", but no one knows. It's all pure conjecture.

Nonetheless, books3, released above, is "all of bibliotik", which I imagine will be of interest to anyone doing NLP work. Or anyone who wants to read 196,640 books. :)
You now have OpenAI-grade training data at your fingertips. books3 is a direct download to 200k plaintext books, preprocessed via ftfy.fix_text(), as OpenAI did.

More importantly, we anticipate that the download link will remain stable, possibly for years to come.
This stable link is thanks to a community of data enthusiasts: They gather and make available data "for the benefit of humankind".

They have an interesting DMCA policy. I urge you to read it:
(Did you read their DMCA policy yet? I almost didn't bother, so I better call it out twice to make sure you don't miss the crucial details. Please, take a moment to review it; you'll find it well-worth your time.)
I can't stress enough how unbelievably impactful this feels. Imagine finally not having to worry about "how do I get imagenet?" or "Where did Celeb-A go?" or "I want to make GPT-3; where is the data?"

In hindsight, a community of data hoarders was a natural fit.
Now, 200k books is all well and good, but what if you're not happy with it? What if you want more?

What if you want to make a GPT model that knows literally everything about programming?

answer: train on "literally all of github".

A 100GB github.tar isn't quite "all of github." That's not scientifically accurate to claim. But "all of github" wouldn't necessarily be much more useful than what you get there. 100GB compressed is a pretty good chunk.

Something like 12GB of C++, 7GB of .py, 4GB of .js.
There's more to say, but this is enough for now. I never know ahead of time whether any given tweet will do well, so this is always a lottery. It *feels* like I should be shouting to everyone about this – who wouldn't want 200,000 books? – but who knows.
Spread it like 🔥🔥


ML discord (~850 members):
I made a script to let you mount books3 remotely. It only takes a few minutes to run, and uses less than 1GB:

Now you don’t need a big hard drive or lots of patience just to poke around the dataset. I think you can start looking at books in < 10 min.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Keep Current with Shawn Presser

Shawn Presser Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!


Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @theshawwn

28 May
lol. So, we're doing some image processing with TPUs. We want to save the results directly to our cloud bucket, rather than having the results be transmitted to our VM, saved locally, then uploaded to our cloud bucket. Got a funny idea...

I guess this will be a ramble:
TPUs support a limited number of operations. But what you get in exchange is a blazingly-fast TPU.

A TPU consists of 8 cores, plus a CPU. (Yes, the TPU has a CPU -- weird concept, but think of it like a big computer with 8 GPUs. Obviously, a computer with GPUs has a CPU.)
In the same way that GPUs are much more restrictive than CPUs – it's a lot easier to write programs for CPUs than GPUs! – the TPU cores are much more restrictive than the TPU's CPU.

But that's a positive statement. It means you get some nice flexibility with the TPU's CPU.
Read 14 tweets
31 Jan
Success: I trained ResNet-50 on imagenet to 75.9% top-1 accuracy in 3.51 minutes using a 512-core TPUv3.

(480,000 images per second. 224x224 res JPG.)

Before you think highly of me, all I did was run Google’s code. It was hard though.

It uses the code from their official MLPerf imagenet benchmark.…

(3.51 minutes for v3-512 is slightly faster than their posted results of 3.85min, too!)
This raises a question: *why* is the official benchmark so blazingly fast? That’s about 930 examples/sec per core. When I tried to write my own code, I could only get 250ex/sec per core. Are they cheating? *gasp*

Spoiler: nope! It’s legit. It’s faster because:
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!