Davis Blalock Profile picture
Aug 27 15 tweets 6 min read
"Understanding Scaling Laws for Recommendation Models"

For two years, the AI world has had this glorious period of believing that big tech companies just need more compute to make their models better, not more user data.

That period is ending. Here's what happened: [1/14]
In 2020, OpenAI published a paper (arxiv.org/abs/2001.08361) assessing the relative effects of scaling up models vs datasets. They found that scaling up models had *way* higher returns. [2/14]
The party was on. We got libraries like DeepSpeed (github.com/microsoft/Deep…) that let you train huge models across countless GPUs. We got trillion-parameter… [3/14]
…Mixture of Experts models (arxiv.org/abs/2101.03961). We talked about the "infinite data regime" because we weren't even bothering to use all the data.

Parameter counts were the headline and sample counts were buried in the results section. [4/14]
Fast-forward to March 2022. DeepMind releases the Chinchilla paper (arxiv.org/abs/2203.15556), which shows that a subtle issue with the OpenAI paper caused it to vastly underestimate the importance of dataset size. [5/14]
With smaller models and more data, the Chinchilla authors got much better results for a fixed compute budget. [6/14]
Moreover, as one well-known commentary pointed out (lesswrong.com/posts/6Fpvch8R…), the Chinchilla scaling formula suggests that there's a *hard limit* for model accuracy that no amount of model size will ever overcome without more data. [7/14]
But all of the above work focused on language models.

The real moneymakers at big tech companies are recommender systems. These are what populate your feeds, choose what ads you see, etc.

Maybe language models need more data, but recommender systems don't? [8/14]
This brings us to the current paper, which studies scaling of recommender models.

Put simply: “We show that parameter scaling is out of steam...and until a higher-performing model architecture emerges, data scaling is the path forward.” [9/14]
In more detail, they conduct a thorough study of click-through rate prediction models, the workhorse of targeted ads.

They really seem to have tried to get model scaling to work, more so than any similar paper I've seen. [10/14]
E.g., they divide models into four components and dig deep into how to scale up each one as a function of model and dataset size.

But even the best-chosen model scaling isn't as good as data scaling. [11/14]
Also, similar to language model work, they find clear power laws. These mean that you need a *multiplicative* increase in data and compute to eliminate a fixed fraction of the errors.

I.e, the need for data + compute is nearly insatiable. [12/14]
I wish they *hadn't* found that recommender models need way more data—especially since recommender data is all about tracking what users see and click on.

But if that's the reality, I'm glad this information is at least shared openly via a well-executed paper. [13/14]
Speaking of which, here's the paper: bit.ly/3ARAjKI

And here's my more detailed synopsis: (dblalock.substack.com/i/69736655/und…) [14/14]
If you like this paper, consider RTing this (or another!) thread to publicize the authors' work, or following the authors: @newsha_a @CarolejeanWu @b_bhushanam.

For more threads like this, follow me or @MosaicML

As always, comments + corrections welcome!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Davis Blalock

Davis Blalock Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @davisblalock

Aug 25
"No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects"

Instead of using a pooling layer or having a stride for your conv, just use a space-to-depth op followed by a non-strided conv. [1/8] Image
This substitution seems to be an improvement. [2/8] Image
This is especially true for small models and when detecting small objects. Most importantly, these improvements seem to hold even when conditioning on single-image inference latency. This is important because it's easy to do "better" when you're slower. [3/8] Image
Read 8 tweets
Aug 23
"Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models"

Another optimizer paper attempting to descend through a crowded valley to beat Adam. But...maybe this one actually does? [1/11]
Their update equation is fairly straightforward, and complements the gradient momentum term with a difference-of-gradients momentum term. [2/11]
It does have an extra hyperparameter compared to Adam (β3), but they hardcode it to 0.08 in all their experiments, so it’s apparently not important to tune. [3/11]
Read 11 tweets
Aug 21
"What Can Transformers Learn In-Context? A Case Study of Simple Function Classes"

Can models learn new, non-trivial functions...with no parameter changes? Turns out the answer is yes, with in-context learning: [1/11] Image
In-context learning is when you include some examples as text in the prompt at test time. Here's a great illustration from @sewon__min et al. (arxiv.org/abs/2202.12837). [2/11] Image
@sewon__min What's new in this paper is that they systematically assess how well in-context learning works for various well-defined function classes. [3/11]
Read 11 tweets
Aug 13
"Language Models Can Teach Themselves to Program Better"

This paper changed my thinking about what future langauge models will be good at, mostly in a really concerning way. Let's start with some context: [1/11]
To teach models to program, you used to give them a natural language prompt. But recent work has shown that you can instead just show them a unit test and tell them to… [2/11]
…generate a program that satisfies it (a “programming puzzle”). This is way nicer because it’s simpler and you can just run the code to see if it works. [3/11]
Read 11 tweets
Aug 6
"An Impartial Take to the CNN vs Transformer Robustness Contest"

Are vision transformers really better than CNNs? This paper strongly suggests an answer, based on a robustness throwdown between {ViT, Swin} vs {BiT, ConvNeXt}. [1/10]
First, they measure the learning of spurious features using datasets designed to assess simplicity bias, background bias, and texture bias. The transformers and the CNNs behave similarly. [2/10]
For OOD detection, transforms and CNNs again work equally well. [3/10]
Read 10 tweets
Aug 4
"Discrete Key-Value Bottleneck"

An intuitive method for making models robust to distribution shift. They replace vectors in the latent space with their nearest centroids, with the clustering… [1/8] Image
…and quantization applied separately to different slices of the feature space. The centroids are learned using a moving average process similar to minibatch k-means. [2/8]
The intuition here is that, when adapting to different input distributions, only certain combinations of codes will come up, so the codes corresponding to other input distributions will be unaffected. [3/8] Image
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(