Cameron R. Wolfe, Ph.D. Profile picture
Research @Netflix • Writer @ Deep (Learning) Focus • PhD @optimalab1 • I make AI understandable

Dec 19, 2022, 8 tweets

After GPT-3 was proposed, a lot of research was done to find an even better language model. Initial attempts focused on just training larger models. Contrary to popular belief, however, there is more to creating a good language model than size… 🧵[1/8]

As proof, we have MT-NLG, a 530 billion parameter language model proposed after GPT-3. Training this model was a massive engineering effort. Some tasks benefit more than others, but this much bigger model did not significantly improve upon GPT-3! [2/8]

In contrast, utilizing a pre-training corpus that is larger, higher-quality, and diverse can yield significant performance benefits. This was shown by Gopher, which improved performance via a larger model and a better pre-training corpus called MassiveText. [3/8]

Initially, we thought that model size was more important than the amount of pre-training data, but we later found out that model and data size are equally important when scaling up language model pre-training. Most language models were too big and not trained enough. [4/8]

To test this hypothesis, researchers proposed Chinchilla, a 70 billion parameter language model. Although Chinchilla is smaller than popular models like Gopher and GPT-3, it can surpass their performance via more extensive pre-training! [5/8]

Beyond model and data size, researchers have shown that language models have an optimal depth and width that changes depending on parameter count. Overall, these findings reveal that (slight) performance benefits are realized with shallower, wider language models. [6/8]

So, better language models seem to be achieved via:

1. Larger models
2. More data
3. The correct architecture

There are probably more factors, but just making the model bigger is sub-optimal and requires a lot of compute/engineering!

arxiv.org/abs/2211.02001

[7/8]

For more details, check out the most recent edition of my newsletter. It studies modern language models like MT-NLG, Gopher, and Chinchilla.

bit.ly/3VcNJb7

These models (and others) have had a massive impact that will continue to expand over time! [8/8]

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling