Pratyush Maini Profile picture
Aug 18, 2025 15 tweets 6 min read Read on X
1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳
- 3B LLMs beat 8B models🚀
- Pareto frontier for performance Image
2/Synthetic data has been the rage with all frontier models (Qwen3, KimiK2, GPT5) using large amounts of synth data. But there is little science. We’ve been working on this for 2+yrs & we are excited to share BeyondWeb.
Blog: blog.datologyai.com/beyondweb
Arxiv: arxiv.org/abs/2508.10975
3/First, the results 📊 BeyondWeb outperforms the strongest baselines like Cosmopedia, WRAP by > 5.1% and Nemotron-Synth by 2.6%. And all of this, while slashing training time by more than half. In fact, our 3B LLMs outperform most 8B baselines! Image
4/Key insight: No one magic bullet solves synthetic data; naive approaches yield little benefit at huge costs. But thoughtful design, like targeted rephrasing of web docs, can exceed natural data. Let’s go deep into each of our learnings, posed as Research Questions (RQ) Image
5/RQ #1: Is synthetic data just knowledge distillation? Generator-driven synthetic data (e.g., Cosmopedia) can be approximated by simple summarization. We ~match their perf with basic prompts that boost per-token info density. But BeyondWeb outperforms via intentional strategies Image
6/RQ #2: Can any synthetic data break the data wall? Synthetic data strategies such as naive continuations barely beat repeating data. The real gains come with thoughtful synthetic data that attempts to fill distributional gaps. BeyondWeb breaches the wall where others can't. Image
7/RQ #3: High Quality Seed Data >>>. Given limited high-quality data, it is more beneficial to use it as seed data for synthetic rephrasing rather than using low-quality data, despite the possible drawback of repeating knowledge from high-quality sources. Image
8/RQ #4: Style matching helps. Web data is only 2.7% conversational, but chat is among LLM's main use. Upsampling conversational styles improves downstream tasks; though gains saturate quickly. BeyondWeb aligns training styles with inference for better generalization. Image
9/RQ #5: Diversity sustains gains at scale. The gains from synthetic data can plateau first if the strategy is static (e.g., fixed textbook style). Diversity-focused approaches like BeyondWeb keep improving over trillions of tokens; critical for long-horizon training. Image
10/RQ #6: Most rephrasers are fine. (1) Synthetic data benefits are largely robust across rephraser model families. (2) Generator model quality is not predictive of synthetic data quality. It may sometimes be anti-correlated with the quality of synthetic data (think benchmaxxing) Image
11/RQ #7: Small LLMs can be effective rephrasers. The quality of synthetic data increases when increasing rephraser size from 1B to 3B, then starts to saturate at 8B. The simplicity of rephrasing makes generator size less critical, enabling highly scalable synthetic pretraining. Image
12/Takeaway: There is no single magic wand that does synthetic data right. We need to jointly optimize for quality, style, diversity, and much more. BeyondWeb sets a new Pareto: train 7.7x faster than raw web, 2.7x faster than Nemotron-Synth. Image
13/ Implications: @datologyai is here to democratize high-quality synthetic data for all. We show a pathway for generating synthetic data cheaply at scale. We're part of curating 7T tokens for @arcee_ai 's AFM4.5B, which already shows our real-world wins!
14/ Dive into the full post for recipes, benchmarks, and other cool experiments. Thanks to the @datologyai team, especially @leavittron and @VineethDorna, for their massive contributions that helped shape this up.
Arxiv: arxiv.org/abs/2508.10975
Blog: blog.datologyai.com/beyondweb
15/Needless to say, such a massive undertaking could not have been accomplished without a stellar engineering team that helped us scale our work to trillions of tokens. If you are excited about this, join us jobs.ashbyhq.com/DatologyAIImage

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Pratyush Maini

Pratyush Maini Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @pratyushmaini

Jun 12, 2024
1/We've nailed a framework to reliably detect if an LLM was trained on your dataset: LLM Dataset Inference.

After over a year of thinking of writing about how hard this is, we had a breakthrough that made me quite literally jump from my seat!

📝: Long🧵 arxiv.org/abs/2406.06443
Image
2/Let's first understand why this is hard: LLMs are trained on trillions of tokens, and usually for just one epoch. This means you likely see any data point only “once”. Models no longer overfit to the train set. This makes the long-studied problem of "membership inference" hard.
3/In fact, a few years ago when I wrote the original dataset inference paper, we proved that as training set sizes approach infinity, the success of membership inference goes to random chance. This result is ripe for today's era of large-scale pretraining.
Read 13 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(