Samip Profile picture
solving generalization at https://t.co/zsptJBlblS
Jun 4 8 tweets 4 min read
1/ Now that we're running out of data, how do you optimally scale multi-epoch pretraining to hundreds of epochs?

Our first paper from Q! q0 trains a population of models, instead of single model that saturates fast, reaching a dramatically lower loss at *every* epoch budget.

w/ @bishmdl76 @akshayvegesna @ShmuelBermanImage 2/ Paper:

q0 is built on one intuition, motivated by Solomonoff induction: instead of training one perfect model, train a population of diverse models and aggregate predictions. Everything in the algorithm follows from this one goal of efficiently training a population. It comes down to three core primitives:arxiv.org/abs/2606.03938
Mar 9 5 tweets 4 min read
1/ We released NanoGPT Slowrun 10 days ago. Already at 8x data efficiency and improving fast, so we're doubling down.

Announcing Slowrun Research and Slowrun Cluster: our open research effort to collaborate with researchers with crazy ideas, and a serious cluster to back it. 2/ Why? Compute scales. Data doesn't. Current scaling laws require both to grow proportionally, and that's a big problem. We need fundamentally new learning algorithms in the limited data, practically infinite compute settings.

Slowrun is already surfacing new data-efficient methods, but we want to aim for at least 100x data efficiency this year and that will take a lot more exploration.