1/ Now that we're running out of data, how do you optimally scale multi-epoch pretraining to hundreds of epochs?
Our first paper from Q! q0 trains a population of models, instead of single model that saturates fast, reaching a dramatically lower loss at *every* epoch budget.
q0 is built on one intuition, motivated by Solomonoff induction: instead of training one perfect model, train a population of diverse models and aggregate predictions. Everything in the algorithm follows from this one goal of efficiently training a population. It comes down to three core primitives:arxiv.org/abs/2606.03938
Mar 9 • 5 tweets • 4 min read
1/ We released NanoGPT Slowrun 10 days ago. Already at 8x data efficiency and improving fast, so we're doubling down.
Announcing Slowrun Research and Slowrun Cluster: our open research effort to collaborate with researchers with crazy ideas, and a serious cluster to back it.
2/ Why? Compute scales. Data doesn't. Current scaling laws require both to grow proportionally, and that's a big problem. We need fundamentally new learning algorithms in the limited data, practically infinite compute settings.
Slowrun is already surfacing new data-efficient methods, but we want to aim for at least 100x data efficiency this year and that will take a lot more exploration.