Jinjie Ni Profile picture
Aug 9 19 tweets 8 min read Read on X
Token crisis: solved. ✅

We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs.

Findings:
> DLMs beat AR when tokens are limited, with >3× data potential.
> A 1B DLM trained on just 1B tokens hits 56% HellaSwag & 33% MMLU — no tricks, no cherry-picks.
> No saturation: more repeats = more gains.

🚨 ”x.openreview.net
We also dissected the serious methodological flaws in our parallel work “Diffusion Beats Autoregressive in Data-Constrained Settings” — let’s raise the bar for open review!

🔗 Blog & details:
jinjieni.notion.site/Diffusion-Lang…

18 🧵s ahead:Image
🧵 1/18

Diffusion language models are super data learners. 🧬

We pre-trained a series of DLMs from scratch for up to 8B parameters and 480B tokens.

It provides compelling evidence that, by repeating on normal web data, DLMs outperform AR counterparts across model sizes in data-constrained settings, demonstrating significantly greater potential without encountering performance saturation.

Overall, our results suggest DLMs exhibit more than threefold greater ultimate data potential compared to AR models.

Witness the intelligence crossovers 👇Image
🧵 2/18

Repeat more, gain more.

To study the full potential of tokens in DLM training, we launched an additional run in which the same 1B-token dataset was repeated for 480 epochs, yielding a total of 480B training tokens.

Notably, it achieves ~56% accuracy on HellaSwag and ~33% on MMLU, significantly outperforming AR’s ~41% and ~29%.

Surprisingly, even under such extreme repetition, performance did not saturate, suggesting that DLMs can extract substantially more signal from a fixed 1B-token corpus.Image
🧵 3/18

Models that “overfit” on the validation set keep improving on downstream tasks.

Why? 👇Image
🧵 4/18

We visualized the average negative log-likelihood (NLL) for the ground-truth and alternative options across multiple-choice evals, along with their respective differences (△NLL)

Even after "overfitting" on validation set, the gap between ground-truth and alternative NLLs (△NLL) continues to widen consistently, indicating that the model's underlying discriminative ability continues to improve despite the rise in validation loss. This phenomenon persists for both in-domain and out-of-domain training data.Image
🧵 5/18

Though being robust to data repetition, DLMs also get overfit–as we train them for enough long epochs.
Larger unique data size delay overfitting, while larger models accelerate its onset.Image
🧵 6/18

Then why DLMs are super data learners?

Reason 1/2

As shown in the below figure, web text data is not fully causal! They can be modeled in other directions, although resulting in a higher loss. This means that modeling the web data purely causally is wasteful!

Bidirectional modeling, enabled by the diffusion objective and the bidirectional attention, extracts more signal from web data.

(figure stole from arxiv.org/abs/2506.19935)Image
🧵 7/18

Reason 2/2

DLMs are super dense models. Their computational super-density—more FLOPs per task—translates directly into greater intelligence.Image
🧵 8/18

AR models prioritize compute efficiency over data potential.

Their transformer design—with teacher forcing and causal masking—maximizes GPU usage but limits modeling capacity.

As compute becomes cheaper, data availability emerges as the key bottleneck—motivating our study of DLMs.
🧵 9/18

The diffusion objective explicitly requires each data point in the pre-training dataset to be corrupted at multiple masking ratios and combinations for effective training (to estimate a more precise expectation), offering another insight into why more data repetitions bring so much gain.Image
🧵 10/18

Coincidentally, a concurrent study:

[1] Prabhudesai, Mihir, et al. "Diffusion Beats Autoregressive in Data-Constrained Settings." arXiv preprint arXiv:2507.15857 (2025).

explores similar topics. However, our careful analysis reveals several methodological issues in it that may lead to flawed conclusions.

Below we detail the “X rolling openreview”.

(Note: there’ve been some initial rounds of x rolling review out there :-) x.com/giffmana/statu…)Image
🧵 11/18

All experiments in [1] employ the loss function (1) without explicit justification. However, this loss differs significantly from the theoretically-grounded and widely adopted masked diffusion language modeling loss (2).

In theory, we can prove loss (1) is not faithfully representing the model likelihood (see H.4 of the MD4 paper arxiv.org/abs/2406.04329 for detailed discussion).

This may lead to serious problems in their conclusions. View more analysis in our blog post.

> We notice that the authors modified the original draft to add a linear time-dependent reweighing in the latest arXiv submission (v3), while we will keep the assumption that all experiments used Equation 1, as the loss range in Figure 4(b) of [1] closely match the behavior expected from Equation 1. We look forward to the release of the codebase (at the time of this post it’s still an empty repo) and the relevant replications from the community.Image
🧵 12/18

Is validation loss a good metric for AR and DLM comparison?

Short answer: when the loss formulation is problematic, certainly not, as they’re not representing the same thing; when the loss formulation is correct, still not, as

> One is measuring exact negative likelihood, while another is an upper bound.
> A lower loss does not mean better capability, evidenced in the above discussions in 🧵3/18.
🧵 13/18

The AR benchmark results reported in [1] are far from the best. I.e., [1] is comparing a premature AR checkpoint with the best diffusion checkpoint—which is unfair.Image
🧵 14/18

[1] compares overfitting trends between AR and diffusion models using a larger model size and a smaller set of unique training tokens for AR—an unfair setup, as larger models trained on less diverse data are inherently prone to earlier overfitting.Image
🧵 15/18

The scaling law formulation used in [1] assumes a non-decreasing validation loss, which fails in practice due to overfitting-induced loss increases.
This flawed assumption leads to poor fits and biases any conclusions derived from its predictions.Image
🧵 16/18

Once again, the blog link:

jinjieni.notion.site/Diffusion-Lang…
🧵 17/18

I strongly encourage the community to post more critical blogs as a more effective “open review”.
Especially now, as traditional conference reviews increasingly lose credibility, robust and transparent community feedback is crucial for advancing science—not just AI—toward healthier and more rigorous standards.
🧵 18/18

We are training a large model on a crazy setup and will release a full paper later. Stay tuned! 😉

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jinjie Ni

Jinjie Ni Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(