In OpenAI's papers on GPT-2 and 3, you'll notice references to datasets named "books1" and "books2".
books1 appears to be bookcorpus, or similar.
But OpenAI will not release information about books2; a crucial mystery.
We suspect OpenAI's books2 dataset might be "all of libgen", but no one knows. It's all pure conjecture.
Nonetheless, books3, released above, is "all of bibliotik", which I imagine will be of interest to anyone doing NLP work. Or anyone who wants to read 196,640 books. :)
You now have OpenAI-grade training data at your fingertips. books3 is a direct download to 200k plaintext books, preprocessed via ftfy.fix_text(), as OpenAI did.
More importantly, we anticipate that the download link will remain stable, possibly for years to come.
This stable link is thanks to a community of data enthusiasts: the-eye.eu. They gather and make available data "for the benefit of humankind".
They have an interesting DMCA policy. I urge you to read it: the-eye.eu/dmca.mp4
(Did you read their DMCA policy yet? the-eye.eu/dmca.mp4 I almost didn't bother, so I better call it out twice to make sure you don't miss the crucial details. Please, take a moment to review it; you'll find it well-worth your time.)
I can't stress enough how unbelievably impactful this feels. Imagine finally not having to worry about "how do I get imagenet?" or "Where did Celeb-A go?" or "I want to make GPT-3; where is the data?"
In hindsight, a community of data hoarders was a natural fit.
Now, 200k books is all well and good, but what if you're not happy with it? What if you want more?
What if you want to make a GPT model that knows literally everything about programming?
A 100GB github.tar isn't quite "all of github." That's not scientifically accurate to claim. But "all of github" wouldn't necessarily be much more useful than what you get there. 100GB compressed is a pretty good chunk.
Something like 12GB of C++, 7GB of .py, 4GB of .js.
There's more to say, but this is enough for now. I never know ahead of time whether any given tweet will do well, so this is always a lottery. It *feels* like I should be shouting to everyone about this – who wouldn't want 200,000 books? – but who knows.
lol. So, we're doing some image processing with TPUs. We want to save the results directly to our cloud bucket, rather than having the results be transmitted to our VM, saved locally, then uploaded to our cloud bucket. Got a funny idea...
I guess this will be a ramble:
TPUs support a limited number of operations. But what you get in exchange is a blazingly-fast TPU.
A TPU consists of 8 cores, plus a CPU. (Yes, the TPU has a CPU -- weird concept, but think of it like a big computer with 8 GPUs. Obviously, a computer with GPUs has a CPU.)
In the same way that GPUs are much more restrictive than CPUs – it's a lot easier to write programs for CPUs than GPUs! – the TPU cores are much more restrictive than the TPU's CPU.
But that's a positive statement. It means you get some nice flexibility with the TPU's CPU.
(3.51 minutes for v3-512 is slightly faster than their posted results of 3.85min, too!)
This raises a question: *why* is the official benchmark so blazingly fast? That’s about 930 examples/sec per core. When I tried to write my own code, I could only get 250ex/sec per core. Are they cheating? *gasp*