Post

https://twitter.com/kevinsxu/status/1949522090916266346

More from @natolambert

Nathan Lambert

@natolambert

Nov 4

https://twitter.com/_maiush/status/1985710942039261196

The first research on the fundamentals of character training -- i.e. applying modern post training techniques to ingrain specific character traits into models.

All models, datasets, code etc released.
Really excited about this project! Sharan was a joy to work with.

https://twitter.com/_maiush/status/1985710942039261196

For people following me for a while, you know this has been on my radar as a massively understudied area, as something that's increasingly impacting the cutting edge of model deployment, not studied in academia, but accessible in cost.

Hoping to seed a small research field here!

The paper is here:

I'll be sharing more thoughts soon, on this, related work, and where character training is heading.arxiv.org/abs/2511.01689

Read 5 tweets

Nathan Lambert

@natolambert

Aug 17

A tier list of China's top 19 open model builders.
Who did we miss?

At the frontier
* DeepSeek
* Qwen

Close competitors
* Moonshot AI (Kimi)
* Zhipu / Z AI

Noteworthy
* StepFun
* Tencent (Hunyuan)
* RedNote (Xiaohongshu)
* MiniMax
* OpenGVLab / InternLM
* Skywork

On the rise
* ByteDance Seed
* OpenBMB
* Xiaomi (MiMo)
* Baidu (ERNIE)

Honorable Mentions
* Multimodal Art Projection
* Alibaba International Digital Commerce Group
* Beijing Academy of Artificial Intelligence (BAAI)
* inclusionAI
* Pangu (Huawei)

I learned a lot from these. We have so much more we need to do to understand how their AI ecosystem works.

This is based on open outputs, which is all we can measure, even though operationally I bet this list could shuffle a ton.

RE ByteDance Seed

Read 5 tweets

Nathan Lambert

@natolambert

Aug 4

America needs to take open models more seriously. This summer the early lead in open model adoption of the US via Llama has been overtaken by Chinese models.

With The American Truly Open Models (ATOM) Project we're looking to build support and express the urgency of this issue.

Crucially this "flip" on open model dominance is about more than adoption, its about performance as well. Chinese models have passed and extended their lead on American open counterparts in the last 12 months.

China's lead is also effecting R&D and those who build with open models. The US and EU used to share the lead, now China has a clear majority in the new finetunes uploaded to huggingface (about 40% come from Qwen models alone, the leading family today)

Read 5 tweets

Nathan Lambert

@natolambert

May 1

Stoked to get the 1B OLMo 2 model out -- this will likely be our most used model. Getting a 1B model we were happy with was a wandering path. I've written some lessons from training it below.

Astute followers of AI releases should be a bit confused by why we are releasing a 1B model as the last one of our releases with OLMo 2. The first models dropped in November of 2024 and we let the 32B cook over the holidays when compute demand was lower — why the heck a 1B now?

The 1B Instruct model is here and a GGUF for local users is here (all OLMo 2’s are here).

The reason is that we didn’t know our 1B base model was actually good enough. If you zoom in on the coming revision to the OLMo 2 paper you’ll see that the base model evaluations are largely “mid.” They’re decent enough to their peer models, but not as strong as the bigger models in the suite. We thought we had to keep pushing modeling decisions for a 1B model (such as fiddling with weight decay settings) or other things that are suitable for small models — i.e. older techniques that don’t scale up to bigger models so are out of fashion. Small model development can be handled much differently than bigger models.

This gap in how big vs. small models can be developed is part of the reason we suspect our final post-training results are strong compared to models like Gemma 3 1B or Llama 3.2 1B. Gemma 3 1B is the only model in its suite without a vision component and Llama is a multimodal release — maybe these changes made their text only performance weaker at the low end? We don’t quite now.

full evals for the 1B models attached.

You can see some things like formatting issues on DROP for Qwen or GSM8K for Gemma 3. These are the small details motivation evaluation change I’ll revisit later.

Turns out we had this 1B model sitting around for a while and it was only when we tried more pretraining tricks that we compared the post-training numbers. The post-training numbers were far better than we expected and made the model best in class! We were sitting on great results and a model the community could love for a while without knowing it was actually good.

The biggest problem here is that we don’t know how base model evaluations indicate a strong model for post-training. Trends seem to point to base model evals being the same as the evals used for post-training. Everything is about generating text well, and now chains of thought well. Qwen 3’s base model evaluations can point at this.

With these evaluations maybe the only thing to care about is perplexity over controlled chunks of text — i.e. how well the next token prediction is working.

If OLMo base models have a hard time competing in an era of mega compute, these insights will be our most valuable contributions. I’ve been pessimistic in the past about our ability to compete with the big players, but we just put out a 1B model competitive with very recent releases and we have been sitting on it for months!

The post-training for the 1B model again proved super robust. When you have a stable recipe, it works. We found the RL gains to be particularly robust — we mostly just had to let it keep running.

The biggest gap we’re trying to close now in post-training is a scalable reasoning recipe. If we want to release state of the art models on popular evaluations, scaling RL and inference-time compute is a requirement. We want to lead on problems like avoiding over-thinking, keeping reasoning usable, and so on, but we’ll see which innovations come first!

I’m personally feeling the big shift that all the leading AI labs have gone through in the last few months. Major changes in expectations comes with major changes in tooling and processes for changing. It’s exciting, but folks all over have been putting in serious effort to do that.

Let us know what you think of this 1B model. It’s been super fun to do mini research on and I suspect a lot of you will also like it for local inference tasks. What a great time to be in language modeling research.

And yes, you can make fun of us for the fact that our 1B model has 1.5B total parameters (1.3B without embedding parameters). We’ll focus on this more in the next versions — just one of those many things to get right.

Micro blog form: natolambert.substack.com/p/in-between-t…

https://x.com/natolambert/status/1917931860057080213

recommend reading @vwxyzjn's thread on RL deets. Not a full reasoning model, but still super cool

https://x.com/natolambert/status/1917931860057080213

Read 4 tweets

Nathan Lambert

@natolambert

Mar 24

Okay okay, spent my weekend gooning around learning GRPO math. Here's some takes.

Essentially, this is me yapping through a recap of smaller details on how GRPO is implemented, what Dr. GRPO changes, why, DAPO, connections to PPO, aggregating batches...

Reading list below.

More coherent version coming to @interconnectsai this week.

I know this format won't be for everyone, but I hope some of you love it!
RLHF Book: rlhfbook.com/c/11-policy-gr…
DeepSeekMath paper: arxiv.org/pdf/2402.03300
Where does ratio come from in PPO? ai.stackexchange.com/questions/3795…
DAPO: arxiv.org/pdf/2503.14476
DAPO announcement: x.com/qiying_yu/stat…
My DAPO recap: x.com/natolambert/st…
Dr. GRPO: github.com/sail-sg/unders…
Dr. GRPO announcement: x.com/zzlccc/status/…
TRL GRPO implementation: github.com/huggingface/tr…
Unbiased GRPO implementation: github.com/sail-sg/oat/bl…
Thread on GRPO implementation on x: x.com/natolambert/st…

Watch on YouTube:

Read 5 tweets

Nathan Lambert

@natolambert

Mar 17

https://twitter.com/eric_haibin_lin/status/1901662955307200974

This is a very tidy little RL paper for reasoning. Their GRPO changes:
1 Two different clip hyperparams, so positive clipping can uplift more unexpected tokens
2 Dynamic sampling -- remove samples w flat reward in batch
3 Per token loss
4 Managing too long generations in loss

https://twitter.com/eric_haibin_lin/status/1901662955307200974

1) Clip Higher
Mostly, PPO / GRPO have a clipping hyperparam. This moves it to two hypers, so the upper / positive logratio step size can be bigger. This is to increase the probability of tokens, such as surprising new tokens in reasoning chains.

2) remove unnecessary samples form batch. Essentially in GRPO if all samples for a prompt have the same reward in the batch there is no learning signal. Removing them improves learning speed.

Read 6 tweets

Share this page!

Enter URL or ID to Unroll

Nathan Lambert

Try unrolling a thread yourself!

More from @natolambert

Nathan Lambert

Nathan Lambert

Nathan Lambert

Nathan Lambert

Nathan Lambert

Nathan Lambert

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!