Davis Blalock Profile picture
Research scientist + first hire @MosaicML. @MIT PhD. I write + retweet threads about machine learning papers. Paper summaries newsletter: https://t.co/xX7NIpsIVZ

Aug 27, 2022, 15 tweets

"Understanding Scaling Laws for Recommendation Models"

For two years, the AI world has had this glorious period of believing that big tech companies just need more compute to make their models better, not more user data.

That period is ending. Here's what happened: [1/14]

In 2020, OpenAI published a paper (arxiv.org/abs/2001.08361) assessing the relative effects of scaling up models vs datasets. They found that scaling up models had *way* higher returns. [2/14]

The party was on. We got libraries like DeepSpeed (github.com/microsoft/Deep…) that let you train huge models across countless GPUs. We got trillion-parameter… [3/14]

…Mixture of Experts models (arxiv.org/abs/2101.03961). We talked about the "infinite data regime" because we weren't even bothering to use all the data.

Parameter counts were the headline and sample counts were buried in the results section. [4/14]

Fast-forward to March 2022. DeepMind releases the Chinchilla paper (arxiv.org/abs/2203.15556), which shows that a subtle issue with the OpenAI paper caused it to vastly underestimate the importance of dataset size. [5/14]

With smaller models and more data, the Chinchilla authors got much better results for a fixed compute budget. [6/14]

Moreover, as one well-known commentary pointed out (lesswrong.com/posts/6Fpvch8R…), the Chinchilla scaling formula suggests that there's a *hard limit* for model accuracy that no amount of model size will ever overcome without more data. [7/14]

But all of the above work focused on language models.

The real moneymakers at big tech companies are recommender systems. These are what populate your feeds, choose what ads you see, etc.

Maybe language models need more data, but recommender systems don't? [8/14]

This brings us to the current paper, which studies scaling of recommender models.

Put simply: “We show that parameter scaling is out of steam...and until a higher-performing model architecture emerges, data scaling is the path forward.” [9/14]

In more detail, they conduct a thorough study of click-through rate prediction models, the workhorse of targeted ads.

They really seem to have tried to get model scaling to work, more so than any similar paper I've seen. [10/14]

E.g., they divide models into four components and dig deep into how to scale up each one as a function of model and dataset size.

But even the best-chosen model scaling isn't as good as data scaling. [11/14]

Also, similar to language model work, they find clear power laws. These mean that you need a *multiplicative* increase in data and compute to eliminate a fixed fraction of the errors.

I.e, the need for data + compute is nearly insatiable. [12/14]

I wish they *hadn't* found that recommender models need way more data—especially since recommender data is all about tracking what users see and click on.

But if that's the reality, I'm glad this information is at least shared openly via a well-executed paper. [13/14]

Speaking of which, here's the paper: bit.ly/3ARAjKI

And here's my more detailed synopsis: (dblalock.substack.com/i/69736655/und…) [14/14]

If you like this paper, consider RTing this (or another!) thread to publicize the authors' work, or following the authors: @newsha_a @CarolejeanWu @b_bhushanam.

For more threads like this, follow me or @MosaicML

As always, comments + corrections welcome!

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling