Ph.D. student @WisconsinCS. Working on foundation models and breaking past scaling laws. Previously CMU @mldcmu, UCSD @ucsd_cse, FCC @fresnocity. π€π€¨π§ e/hmm
Apr 6 β’ 11 tweets β’ 4 min read
That new LFM2.5-350M is super overtrained, right? And everyone was shocked about how far they pushed it?
As it turns out, we have a brand new scaling law for that! π§΅
[1/n]
Introducing Train-to-Test (TΒ²) scaling! We found that test-time scaling via repeated sampling means that radical overtraining like this - training a smaller model for way longer - is actually compute optimal! π
We find that knowledge and reasoning exhibit different scaling behaviors!
Super excited to finally tell you all about our paper on the compute optimal scaling of skills:
[1/n] arxiv.org/pdf/2503.10061
(First some context) Scaling laws can tell you how to use your compute budget.
In compute optimal scaling, you are given a compute budget, and you need to decide how to use it, often by balancing model size with the amount of training data