Pratyush Maini Profile picture
Data Quality x Privacy | PhD @mldcmu | Founding Team @datologyai | BTech @iitdelhi 🦋: https://t.co/LWId4rfbvQ
Aug 18, 2025 15 tweets 6 min read
1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳
- 3B LLMs beat 8B models🚀
- Pareto frontier for performance Image 2/Synthetic data has been the rage with all frontier models (Qwen3, KimiK2, GPT5) using large amounts of synth data. But there is little science. We’ve been working on this for 2+yrs & we are excited to share BeyondWeb.
Blog: blog.datologyai.com/beyondweb
Arxiv: arxiv.org/abs/2508.10975
Jun 12, 2024 13 tweets 4 min read
1/We've nailed a framework to reliably detect if an LLM was trained on your dataset: LLM Dataset Inference.

After over a year of thinking of writing about how hard this is, we had a breakthrough that made me quite literally jump from my seat!

📝: Long🧵 arxiv.org/abs/2406.06443
Image 2/Let's first understand why this is hard: LLMs are trained on trillions of tokens, and usually for just one epoch. This means you likely see any data point only “once”. Models no longer overfit to the train set. This makes the long-studied problem of "membership inference" hard.