Samip Profile picture
Jun 4 8 tweets 4 min read Read on X
1/ Now that we're running out of data, how do you optimally scale multi-epoch pretraining to hundreds of epochs?

Our first paper from Q! q0 trains a population of models, instead of single model that saturates fast, reaching a dramatically lower loss at *every* epoch budget.

w/ @bishmdl76 @akshayvegesna @ShmuelBermanImage
2/ Paper:

q0 is built on one intuition, motivated by Solomonoff induction: instead of training one perfect model, train a population of diverse models and aggregate predictions. Everything in the algorithm follows from this one goal of efficiently training a population. It comes down to three core primitives:arxiv.org/abs/2606.03938
3/ Primitive 1: fast exploration of weight space. Training many models from scratch to build a population is too expensive. Inspired by FGE, we collect many models along a few parallel cyclic trajectories. The mechanism is anti-correlating weight decay with the LR, so each cycle explores early (high LR, low WD) then settles into a low-norm basin right before we snapshot.

Primitive 2: model capability compounding via chain distillation. Independently trained models all come out about equally good, so adding more doesn't lift quality. We train each model against its predecessor as a frozen teacher (KL on soft targets), so every model improves on the last and the population compounds.

Primitive 3: a learned generalization prior. Uniform averaging wastes the good members. We fit one softmax weighting over models on a held-out set by minimizing ensemble loss, then reuse it to pick and weight the best K models for any inference budget.
4/ Now the results at 256 epochs. A single model saturates after ~16 epochs. Our strong ensembling baseline pushes past that but converges slowly.

q0 matches the ensemble baseline at only ~56 epochs of training, 4.6x fewer, and keeps improving through 256 to a val loss of 3.003 vs the baseline's 3.048. The gains translate to downstream benchmarks too.Image
5/ Importantly, these gains hold at every epoch budget, from one to hundreds. But the optimal allocation shifts with scale. A budget splits across three knobs: parallel base models, cycles per model, and cycle length.

Small budgets want a single base model, with frequent cycles packed toward the end of training. As the budget grows, adding parallel base models starts to pay. Roughly one more base each time you double the epochs (one base up to ~128 epochs, two to ~256, three to ~512).Image
6/ I'm confident this beats standard pretraining at any budget, even a single epoch, but the biggest limitation is inference cost. An ensemble of K models means K forward passes. It's effectively a way of growing the combined model's parameter count, like scaling depth but without the saturation depth scaling faces.

As with any large model, the fix is distillation into a single model, which tends to work magically well, but we leave that to future work.Image
7/ Looking beyond this paper: scaling compute against a fixed, limited pool of data will need new primitives. Searching over a population of models is a different problem than standard gradient descent training and we've barely scratched the surface. We hope q0 pushes people toward crazy ideas in multi-epoch training and scaling compute in general!!
8/ Huge thanks to Andrew Gordon Wilson (@andrewgwils) for feedback on the paper!

Code at Slowrun: github.com/qlabs-eng/slow…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Samip

Samip Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @industriaalist

Mar 9
1/ We released NanoGPT Slowrun 10 days ago. Already at 8x data efficiency and improving fast, so we're doubling down.

Announcing Slowrun Research and Slowrun Cluster: our open research effort to collaborate with researchers with crazy ideas, and a serious cluster to back it.
2/ Why? Compute scales. Data doesn't. Current scaling laws require both to grow proportionally, and that's a big problem. We need fundamentally new learning algorithms in the limited data, practically infinite compute settings.

Slowrun is already surfacing new data-efficient methods, but we want to aim for at least 100x data efficiency this year and that will take a lot more exploration.
3/ Some research directions we think are interesting:

a. Replacing gradient descent: SGD was designed for limited compute, unlimited data - precisely the opposite of where we're headed. Alternatives that leverage more compute for broader exploration of the loss landscape become viable in the Slowrun regime. Evolutionary algorithms have been scaled efficiently to billions of parameters and can already surpass backprop, esp in rough optimization landscapes where gradients are noisy. arxiv.org/abs/2511.16652, arxiv.org/abs/2509.24372

b. Diffusion Models: DLMs seem significantly more data-efficient than AR. There are a few reasons to believe this: they use more FLOPs at train and test time via iterative denoising, and they get built-in data augmentation from different corruption patterns per sequence. But they haven't been stress-tested against the kind of improvements Slowrun is finding for AR models. arxiv.org/abs/2511.03276, arxiv.org/pdf/2507.15857
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(