Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Nicholas Roberts

@nick11roberts

Apr 6 • 11 tweets • 4 min read • Read on X

Scrolly

That new LFM2.5-350M is super overtrained, right? And everyone was shocked about how far they pushed it?
As it turns out, we have a brand new scaling law for that! 🧵

[1/n]

Introducing Train-to-Test (T²) scaling! We found that test-time scaling via repeated sampling means that radical overtraining like this - training a smaller model for way longer - is actually compute optimal! 🙀

[2/n]arxiv.org/abs/2604.01411

We consider the combined effect of (Chinchilla-style) pretraining scaling AND inference scaling via repeated sampling. When we test-time scale Chinchilla to a matched inference budget (smaller models get more inference), WAY overtraining becomes compute optimal

[3/n]

With T², we take two different scaling approaches and get similar results with both of them! The first models the task NLL, and the second models the pass at k accuracy for your task

[4/n]

Even though two T² approaches are completely different, both recommend WAY OVERTRAINING your models! Here's what they look like compared to regular Chinchilla scaling 🤯🤯🤯

[5/n]

But nobody really does repeated sampling on base models, so we also looked into how this interacts with post-training

The T² overtraining forecasts survive post-training, in some cases slowed down a bit. This is consistent with (cc @jacspringer)

[6/n]arxiv.org/abs/2503.19206

https://x.com/liquidai/status/2039029360175329458?s=20

The obvious next question is if this holds up at scale Well the day we finished the paper, @liquidai released LFM2.5-350M, which ended up being really amazing validation because our scaling law *nearly predicts it*

[7/n]

https://x.com/liquidai/status/2039029360175329458?s=20

https://x.com/awnihannun/status/2039387385142853838?s=20

There has also been a trend of only using Chinchilla for your largest models, and overtraining small ones for cheap inference

Chinchilla was feared to be dead... If you want small models that are THIS overtrained, why use Chinchilla at all?

[8/n]

https://x.com/awnihannun/status/2039387385142853838?s=20

https://x.com/nick11roberts/status/2039467024917602638?s=20

But our T² scaling says that test-time scaling can bring it back! Long live the T²! T² contextualizes how Chinchilla is still useful in the modern overtraining regime (smaller models, more tokens, more test-time scaling!)

[9/n]

https://x.com/nick11roberts/status/2039467024917602638?s=20

This paper was a blast to work on. Shoutout to my amazing coauthors in @sprocket_lab and @HazyResearch, @zihengh1 @GOrlanski @atrost3122 @SungjunCh0 @Zhiqi_Gao_2001 @albertwu7716 @ekellbuch @awsTO @fredsala

[10/n]arxiv.org/abs/2604.01411

https://x.com/nick11roberts/status/1902875088438833291?s=20

Also huge thank you to the folks who got me into scaling laws back when I interned at Meta! @_dieuwke_ @niladrichat @sharan0909 @ml_perception

[11/11]

https://x.com/nick11roberts/status/1902875088438833291?s=20

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @nick11roberts

Nicholas Roberts

@nick11roberts

Mar 21, 2025

📉📉NEW SCALING LAW PHENOMENON 📉📉

We find that knowledge and reasoning exhibit different scaling behaviors!

Super excited to finally tell you all about our paper on the compute optimal scaling of skills:

[1/n] arxiv.org/pdf/2503.10061

(First some context) Scaling laws can tell you how to use your compute budget.

In compute optimal scaling, you are given a compute budget, and you need to decide how to use it, often by balancing model size with the amount of training data

[2/n]

The best choice of model size and token count is called the "compute optimum," which is usually selected using a validation set...

What if we made decisions about this tradeoff by measuring specific skills of our models, rather than average perf?

[3/n]

Read 12 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Nicholas Roberts

Try unrolling a thread yourself!

More from @nick11roberts

Nicholas Roberts

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!