Nicholas Roberts Profile picture
Apr 6 11 tweets 4 min read Read on X
That new LFM2.5-350M is super overtrained, right? And everyone was shocked about how far they pushed it?
As it turns out, we have a brand new scaling law for that! 🧵

[1/n] Image
Introducing Train-to-Test (T²) scaling! We found that test-time scaling via repeated sampling means that radical overtraining like this - training a smaller model for way longer - is actually compute optimal! 🙀



[2/n]arxiv.org/abs/2604.01411
We consider the combined effect of (Chinchilla-style) pretraining scaling AND inference scaling via repeated sampling. When we test-time scale Chinchilla to a matched inference budget (smaller models get more inference), WAY overtraining becomes compute optimal

[3/n] Image
With T², we take two different scaling approaches and get similar results with both of them! The first models the task NLL, and the second models the pass at k accuracy for your task

[4/n] Image
Image
Even though two T² approaches are completely different, both recommend WAY OVERTRAINING your models! Here's what they look like compared to regular Chinchilla scaling 🤯🤯🤯

[5/n] Image
But nobody really does repeated sampling on base models, so we also looked into how this interacts with post-training

The T² overtraining forecasts survive post-training, in some cases slowed down a bit. This is consistent with (cc @jacspringer)

[6/n]arxiv.org/abs/2503.19206
The obvious next question is if this holds up at scale Well the day we finished the paper, @liquidai released LFM2.5-350M, which ended up being really amazing validation because our scaling law *nearly predicts it*

[7/n]
There has also been a trend of only using Chinchilla for your largest models, and overtraining small ones for cheap inference

Chinchilla was feared to be dead... If you want small models that are THIS overtrained, why use Chinchilla at all?

[8/n]
But our T² scaling says that test-time scaling can bring it back! Long live the T²! T² contextualizes how Chinchilla is still useful in the modern overtraining regime (smaller models, more tokens, more test-time scaling!)

[9/n]
This paper was a blast to work on. Shoutout to my amazing coauthors in @sprocket_lab and @HazyResearch, @zihengh1 @GOrlanski @atrost3122 @SungjunCh0 @Zhiqi_Gao_2001 @albertwu7716 @ekellbuch @awsTO @fredsala

[10/n]arxiv.org/abs/2604.01411
Also huge thank you to the folks who got me into scaling laws back when I interned at Meta! @_dieuwke_ @niladrichat @sharan0909 @ml_perception

[11/11]

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Nicholas Roberts

Nicholas Roberts Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @nick11roberts

Mar 21, 2025
📉📉NEW SCALING LAW PHENOMENON 📉📉

We find that knowledge and reasoning exhibit different scaling behaviors!

Super excited to finally tell you all about our paper on the compute optimal scaling of skills:


[1/n] arxiv.org/pdf/2503.10061Image
(First some context) Scaling laws can tell you how to use your compute budget.

In compute optimal scaling, you are given a compute budget, and you need to decide how to use it, often by balancing model size with the amount of training data

[2/n]
The best choice of model size and token count is called the "compute optimum," which is usually selected using a validation set...

What if we made decisions about this tradeoff by measuring specific skills of our models, rather than average perf?

[3/n]
Read 12 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(