We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs.
Findings:
> DLMs beat AR when tokens are limited, with >3× data potential.
> A 1B DLM trained on just 1B tokens hits 56% HellaSwag & 33% MMLU — no tricks, no cherry-picks.
> No saturation: more repeats = more gains.
🚨 ”x.openreview.net”
We also dissected the serious methodological flaws in our parallel work “Diffusion Beats Autoregressive in Data-Constrained Settings” — let’s raise the bar for open review!
Diffusion language models are super data learners. 🧬
We pre-trained a series of DLMs from scratch for up to 8B parameters and 480B tokens.
It provides compelling evidence that, by repeating on normal web data, DLMs outperform AR counterparts across model sizes in data-constrained settings, demonstrating significantly greater potential without encountering performance saturation.
Overall, our results suggest DLMs exhibit more than threefold greater ultimate data potential compared to AR models.
Witness the intelligence crossovers 👇
🧵 2/18
Repeat more, gain more.
To study the full potential of tokens in DLM training, we launched an additional run in which the same 1B-token dataset was repeated for 480 epochs, yielding a total of 480B training tokens.
Notably, it achieves ~56% accuracy on HellaSwag and ~33% on MMLU, significantly outperforming AR’s ~41% and ~29%.
Surprisingly, even under such extreme repetition, performance did not saturate, suggesting that DLMs can extract substantially more signal from a fixed 1B-token corpus.
🧵 3/18
Models that “overfit” on the validation set keep improving on downstream tasks.
Why? 👇
🧵 4/18
We visualized the average negative log-likelihood (NLL) for the ground-truth and alternative options across multiple-choice evals, along with their respective differences (△NLL)
Even after "overfitting" on validation set, the gap between ground-truth and alternative NLLs (△NLL) continues to widen consistently, indicating that the model's underlying discriminative ability continues to improve despite the rise in validation loss. This phenomenon persists for both in-domain and out-of-domain training data.
🧵 5/18
Though being robust to data repetition, DLMs also get overfit–as we train them for enough long epochs.
Larger unique data size delay overfitting, while larger models accelerate its onset.
🧵 6/18
Then why DLMs are super data learners?
Reason 1/2
As shown in the below figure, web text data is not fully causal! They can be modeled in other directions, although resulting in a higher loss. This means that modeling the web data purely causally is wasteful!
Bidirectional modeling, enabled by the diffusion objective and the bidirectional attention, extracts more signal from web data.
DLMs are super dense models. Their computational super-density—more FLOPs per task—translates directly into greater intelligence.
🧵 8/18
AR models prioritize compute efficiency over data potential.
Their transformer design—with teacher forcing and causal masking—maximizes GPU usage but limits modeling capacity.
As compute becomes cheaper, data availability emerges as the key bottleneck—motivating our study of DLMs.
🧵 9/18
The diffusion objective explicitly requires each data point in the pre-training dataset to be corrupted at multiple masking ratios and combinations for effective training (to estimate a more precise expectation), offering another insight into why more data repetitions bring so much gain.
🧵 10/18
Coincidentally, a concurrent study:
[1] Prabhudesai, Mihir, et al. "Diffusion Beats Autoregressive in Data-Constrained Settings." arXiv preprint arXiv:2507.15857 (2025).
explores similar topics. However, our careful analysis reveals several methodological issues in it that may lead to flawed conclusions.
Below we detail the “X rolling openreview”.
(Note: there’ve been some initial rounds of x rolling review out there :-) x.com/giffmana/statu…)
🧵 11/18
All experiments in [1] employ the loss function (1) without explicit justification. However, this loss differs significantly from the theoretically-grounded and widely adopted masked diffusion language modeling loss (2).
In theory, we can prove loss (1) is not faithfully representing the model likelihood (see H.4 of the MD4 paper arxiv.org/abs/2406.04329 for detailed discussion).
This may lead to serious problems in their conclusions. View more analysis in our blog post.
> We notice that the authors modified the original draft to add a linear time-dependent reweighing in the latest arXiv submission (v3), while we will keep the assumption that all experiments used Equation 1, as the loss range in Figure 4(b) of [1] closely match the behavior expected from Equation 1. We look forward to the release of the codebase (at the time of this post it’s still an empty repo) and the relevant replications from the community.
🧵 12/18
Is validation loss a good metric for AR and DLM comparison?
Short answer: when the loss formulation is problematic, certainly not, as they’re not representing the same thing; when the loss formulation is correct, still not, as
> One is measuring exact negative likelihood, while another is an upper bound.
> A lower loss does not mean better capability, evidenced in the above discussions in 🧵3/18.
🧵 13/18
The AR benchmark results reported in [1] are far from the best. I.e., [1] is comparing a premature AR checkpoint with the best diffusion checkpoint—which is unfair.
🧵 14/18
[1] compares overfitting trends between AR and diffusion models using a larger model size and a smaller set of unique training tokens for AR—an unfair setup, as larger models trained on less diverse data are inherently prone to earlier overfitting.
🧵 15/18
The scaling law formulation used in [1] assumes a non-decreasing validation loss, which fails in practice due to overfitting-induced loss increases.
This flawed assumption leads to poor fits and biases any conclusions derived from its predictions.
I strongly encourage the community to post more critical blogs as a more effective “open review”.
Especially now, as traditional conference reviews increasingly lose credibility, robust and transparent community feedback is crucial for advancing science—not just AI—toward healthier and more rigorous standards.
🧵 18/18
We are training a large model on a crazy setup and will release a full paper later. Stay tuned! 😉
• • •
Missing some Tweet in this thread? You can try to
force a refresh
The final models are still training while the insights can’t wait.
⚠️ Someone with the interest to train a large diffusion MoE (> 20B) welcome to sponsor us!!!
✅ The code and data are ready and we can get everything done in two weeks (depending on how many GPUs).
🫡
3/16
🤔 First of all, does diffusion + MoE even work?
We train four models from scratch: (1) Dense 1.7B. (2) Dense 8B. (3) MoE with 1.7B activated parameters, 8B total, either expert-choice or token-choice.
As shown in the below Figure, both types of MoEs fall between their FLOPs-matching and parameter-matching dense models.
Notably, HellaSwag emphasizes commonsense reasoning, while MMLU is knowledge-intensive.
The results suggest that “moefying” is more closely tied to knowledge than to reasoning.
To train the best diffusion language model in world within 1 year, using 800 TPU pods, which model size will you go for?
🐿️ We build Quokka to help you decide–the first-ever large-scale scaling law for DLMs.
Interesting facts:
1. Quokka is a good friend of Chinchilla, while it’s bigger:
> Not only compute-optimal scaling law, also large-scale data-constrained ones.
> Not only scaling laws, but also sweeping the key modeling & optimization choices.
> Up to 11B model params, 260B tokens, 24000+ runs …
2. LLaDA is over-trained. It could have been 2x bigger and trained for 2x less tokens for optimal performance.
3. You can train a 10B DLM on 1T unique tokens for ~1100 epochs and keep observing performance gains. Smell the AGI?
As said, we also ablated many key modeling & optimization choices: transition kernels, diffusion schedules, curriculum strategies, loss formulation, learning rate, batch size, weight decay …
and got many interesting observations...
E.g., we can directly use the optimal hyper-parameters from AR models’ for optimal DLM training!
🤔 To train the best diffusion language model in world within 1 year, using 800 TPU pods, which model size will you go for?
Note that you have limited compute, so a larger model means less data. There must be a tradeoff, right?
Quokka’s compute-constrained scaling law is designed to solve this question.
We use two approaches to cross-validate the scaling law coefficients:
> Approach 1: IsoFLOPs Profiles
> Approach 2: Fitting a Parametric Loss Function
🧵 2/24 Quokka’s compute-constrained scaling law
Approach 1: IsoFLOPs Profiles
In the first approach, we vary model size across nine fixed training FLOPs budgets, ranging from 3 × 10^18 to 1 × 10^21 FLOPs, and record the final training loss at each point.
This directly answers the question: 🤔 For a given FLOPs budget, what is the compute-optimal parameter count?
> In the left figure shown below, we observe a very clear performance valley for every single FLOPs budget–we can find the best param-token allocation!
> It’s very intriguing to see that the optimal parameters and tokens have a perfect linear relationship with FLOPs (the middle and right figure)!
> That linear relationship of optimal parameters Nₒₚₜ, optimal data size Dₒₚₜ, and compute C, can be represented as: Nₒₚₜ ∝ Cᵃ and Dₒₚₜ ∝ Cᵇ. Using approach 1, we get a=0.51 and b=0.49.