Latest Twitter Threads by @pratyushmaini on Thread Reader App

Aug 18, 2025 • 15 tweets • 6 min read

1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳
- 3B LLMs beat 8B models🚀
- Pareto frontier for performance

2/Synthetic data has been the rage with all frontier models (Qwen3, KimiK2, GPT5) using large amounts of synth data. But there is little science. We’ve been working on this for 2+yrs & we are excited to share BeyondWeb.
Blog: blog.datologyai.com/beyondweb
Arxiv: arxiv.org/abs/2508.10975

Jun 12, 2024 • 13 tweets • 4 min read

1/We've nailed a framework to reliably detect if an LLM was trained on your dataset: LLM Dataset Inference.

After over a year of thinking of writing about how hard this is, we had a breakthrough that made me quite literally jump from my seat!

📝: Long🧵 arxiv.org/abs/2406.06443

2/Let's first understand why this is hard: LLMs are trained on trillions of tokens, and usually for just one epoch. This means you likely see any data point only “once”. Models no longer overfit to the train set. This makes the long-studied problem of "membership inference" hard.

Share this page!

Enter URL or ID to Unroll