Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Davis Blalock

@davisblalock

Aug 27, 2022 • 15 tweets • 6 min read • Read on X

Scrolly

"Understanding Scaling Laws for Recommendation Models"

For two years, the AI world has had this glorious period of believing that big tech companies just need more compute to make their models better, not more user data.

That period is ending. Here's what happened: [1/14]

In 2020, OpenAI published a paper (arxiv.org/abs/2001.08361) assessing the relative effects of scaling up models vs datasets. They found that scaling up models had *way* higher returns. [2/14]

The party was on. We got libraries like DeepSpeed (github.com/microsoft/Deep…) that let you train huge models across countless GPUs. We got trillion-parameter… [3/14]

…Mixture of Experts models (arxiv.org/abs/2101.03961). We talked about the "infinite data regime" because we weren't even bothering to use all the data.

Parameter counts were the headline and sample counts were buried in the results section. [4/14]

Fast-forward to March 2022. DeepMind releases the Chinchilla paper (arxiv.org/abs/2203.15556), which shows that a subtle issue with the OpenAI paper caused it to vastly underestimate the importance of dataset size. [5/14]

With smaller models and more data, the Chinchilla authors got much better results for a fixed compute budget. [6/14]

Moreover, as one well-known commentary pointed out (lesswrong.com/posts/6Fpvch8R…), the Chinchilla scaling formula suggests that there's a *hard limit* for model accuracy that no amount of model size will ever overcome without more data. [7/14]

But all of the above work focused on language models.

The real moneymakers at big tech companies are recommender systems. These are what populate your feeds, choose what ads you see, etc.

Maybe language models need more data, but recommender systems don't? [8/14]

This brings us to the current paper, which studies scaling of recommender models.

Put simply: “We show that parameter scaling is out of steam...and until a higher-performing model architecture emerges, data scaling is the path forward.” [9/14]

In more detail, they conduct a thorough study of click-through rate prediction models, the workhorse of targeted ads.

They really seem to have tried to get model scaling to work, more so than any similar paper I've seen. [10/14]

E.g., they divide models into four components and dig deep into how to scale up each one as a function of model and dataset size.

But even the best-chosen model scaling isn't as good as data scaling. [11/14]

Also, similar to language model work, they find clear power laws. These mean that you need a *multiplicative* increase in data and compute to eliminate a fixed fraction of the errors.

I.e, the need for data + compute is nearly insatiable. [12/14]

I wish they *hadn't* found that recommender models need way more data—especially since recommender data is all about tracking what users see and click on.

But if that's the reality, I'm glad this information is at least shared openly via a well-executed paper. [13/14]

Speaking of which, here's the paper: bit.ly/3ARAjKI

And here's my more detailed synopsis: (dblalock.substack.com/i/69736655/und…) [14/14]

@newsha_a

If you like this paper, consider RTing this (or another!) thread to publicize the authors' work, or following the authors: @newsha_a @CarolejeanWu @b_bhushanam.

For more threads like this, follow me or @MosaicML

As always, comments + corrections welcome!

https://twitter.com/davisblalock/status/1563455844670246912

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @davisblalock

Davis Blalock

@davisblalock

Jul 3

While I briefly have no employer, let me tell you what's really happening with AI companies training on public data: [1/n]

There are roughly two groups of actors:
1. Those that care about US + EU laws and regulations.
2. Those that don't.

But both look the same from the outside.

The companies that don't care either aren't subject to these laws or are more worried about dying from irrelevance than lawsuits.

Legal battles can drain you but not having a good enough product can kill you.

Read 12 tweets

Davis Blalock

@davisblalock

Jul 1

Deep learning training is a mathematical dumpster fire.

But it turns out that if you *fix* the math, everything kinda just works…fp8 training, hyperparameter transfer, training stability, and more. [1/n]

By “dumpster fire”, I mean not just well-known issues like vanishing gradients or loss spikes, but also subtle stuff like the variance of your token embeddings collapsing in hard-to-model ways as your sequence length grows. [2/11]

By “fix the math,” I mean apply a few principled interventions to your transformer that make all your tensors have close to unit variance.

No overhead necessary—just getting the details right. [3/11]

Read 11 tweets

Davis Blalock

@davisblalock

Mar 1, 2024

https://twitter.com/_akhaliq/status/1763374329457189283

I've never seen claims of full bf16 parity with <2bit weights before, so there's reason to be cautiously optimistic here.

But since people seem to have the "optimistic" down, let me add some caution:

1) Despite the title, this paper does not use 1-bit weights. Instead, [1/n]

https://twitter.com/_akhaliq/status/1763374329457189283

...it uses ternary quantization, requiring each weight to be one of {-𝛼, 0, 𝛼} for some tensor-specific 𝛼.

This takes *2* bits if stored naively, ~1.58 with perfect entropy coding, and 1.6 in the likely case that you pack 5 values in 8 bits (3^5 = 243 <= 255). [2/n]

2) The paper doesn't compare to any other ternary quantization method, so it's unclear how well this particular scheme works.

There have been countless binary and ternary quantization papers over the past decade, so this would almost certainly not get through peer review. [3/n]

Read 10 tweets

Davis Blalock

@davisblalock

Apr 29, 2023

I've written about 500+ machine learning papers in the past year. Here are some of my most popular threads: [1/n]

https://twitter.com/davisblalock/status/1558347542101839873

https://twitter.com/davisblalock/status/1558347542101839873

https://twitter.com/davisblalock/status/1602600453555961856

https://twitter.com/davisblalock/status/1602600453555961856

Read 9 tweets

Davis Blalock

@davisblalock

Apr 23, 2023

"FP8 versus INT8 for efficient deep learning inference"

Is fp8 just plain better than int8?

No. There are tradeoffs between the two at various levels of the stack, and this paper digs into their strengths and weaknesses. [1/11]

First, for a fixed number of bits, floating point addition takes more transistors. [2/11]

The same is true of multipliers. [3/11]

Read 11 tweets

Davis Blalock

@davisblalock

Apr 22, 2023

"UniverSeg: Universal Medical Image Segmentation"

What if we could train a single neural net to highlight important structures in any medical image given just a few examples? [1/13]

They make this happen by assembling a huge dataset, designing an appropriate model, and using a particular training setup.

First, they aggregate a ton of medical imaging datasets into a large corpus called MegaMedical. [2/13]

Second, they design a modified U-Net whose blocks jointly look at the “query” image and a few reference images. Think of these reference images as in-context examples, and the combination of the input image and these examples as a visual "prompt". [3/13]

Read 13 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Davis Blalock

Try unrolling a thread yourself!

More from @davisblalock

Davis Blalock

Davis Blalock

Davis Blalock

Davis Blalock

Davis Blalock

Davis Blalock

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!