Jinjie Ni Profile picture
AI researcher building foundation models. I'm on the job market.
Oct 2 16 tweets 9 min read
Announcing OpenMoE 2, the first-ever architectural study of sparse diffusion language models, trained from scratch.

✅ Expert-choice MoE × diffusion
✅ Ultra-wide FLOPs/param range (sparse → super-dense)
✅ Perfect load-balance (no aux loss)
✅ +20% throughput
✅ adaptive computing

😯 In multi-epoch training, diffusion + MoE is a double win, where AR MoE collapses.

Blog:
jinjieni.notion.site/OpenMoE-2-Spar…Image 2/16

The final models are still training while the insights can’t wait.

⚠️ Someone with the interest to train a large diffusion MoE (> 20B) welcome to sponsor us!!!

✅ The code and data are ready and we can get everything done in two weeks (depending on how many GPUs).

🫡
Sep 30 25 tweets 15 min read
🍷Imagine you are the boss of Google DeepMind.

To train the best diffusion language model in world within 1 year, using 800 TPU pods, which model size will you go for?

🐿️ We build Quokka to help you decide–the first-ever large-scale scaling law for DLMs.

Interesting facts:

1. Quokka is a good friend of Chinchilla, while it’s bigger:
> Not only compute-optimal scaling law, also large-scale data-constrained ones.
> Not only scaling laws, but also sweeping the key modeling & optimization choices.
> Up to 11B model params, 260B tokens, 24000+ runs …

2. LLaDA is over-trained. It could have been 2x bigger and trained for 2x less tokens for optimal performance.

3. You can train a 10B DLM on 1T unique tokens for ~1100 epochs and keep observing performance gains. Smell the AGI?

As said, we also ablated many key modeling & optimization choices: transition kernels, diffusion schedules, curriculum strategies, loss formulation, learning rate, batch size, weight decay …

and got many interesting observations...
E.g., we can directly use the optimal hyper-parameters from AR models’ for optimal DLM training!

Find them all in the paper:

> Paper (main url): jinjieni.github.io/Quokka/resourc…
> Paper (backup url): gitee.com/JinjieNi/quokk…
> GitHub: github.com/JinjieNi/Quokka

24🧵s ahead:Image 🧵 1/24 Quokka’s compute-constrained scaling law

Let’s get back to that question:

🤔 To train the best diffusion language model in world within 1 year, using 800 TPU pods, which model size will you go for?

Note that you have limited compute, so a larger model means less data. There must be a tradeoff, right?

Quokka’s compute-constrained scaling law is designed to solve this question.
We use two approaches to cross-validate the scaling law coefficients:

> Approach 1: IsoFLOPs Profiles
> Approach 2: Fitting a Parametric Loss Function
Aug 9 19 tweets 8 min read
Token crisis: solved. ✅

We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs.

Findings:
> DLMs beat AR when tokens are limited, with >3× data potential.
> A 1B DLM trained on just 1B tokens hits 56% HellaSwag & 33% MMLU — no tricks, no cherry-picks.
> No saturation: more repeats = more gains.

🚨 ”x.openreview.net
We also dissected the serious methodological flaws in our parallel work “Diffusion Beats Autoregressive in Data-Constrained Settings” — let’s raise the bar for open review!

🔗 Blog & details:
jinjieni.notion.site/Diffusion-Lang…

18 🧵s ahead:Image 🧵 1/18

Diffusion language models are super data learners. 🧬

We pre-trained a series of DLMs from scratch for up to 8B parameters and 480B tokens.

It provides compelling evidence that, by repeating on normal web data, DLMs outperform AR counterparts across model sizes in data-constrained settings, demonstrating significantly greater potential without encountering performance saturation.

Overall, our results suggest DLMs exhibit more than threefold greater ultimate data potential compared to AR models.

Witness the intelligence crossovers 👇Image