Nathan Lambert Profile picture
Jul 27 1 tweets 1 min read Read on X
The dominance of Chinese open models is undersold in this post.

Top four open models being Chinese is one thing, but also is the full top 10 of models were pre and post train is done in house, and 18 of the top 20. Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Nathan Lambert

Nathan Lambert Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @natolambert

May 1
Stoked to get the 1B OLMo 2 model out -- this will likely be our most used model. Getting a 1B model we were happy with was a wandering path. I've written some lessons from training it below.

Astute followers of AI releases should be a bit confused by why we are releasing a 1B model as the last one of our releases with OLMo 2. The first models dropped in November of 2024 and we let the 32B cook over the holidays when compute demand was lower — why the heck a 1B now?

The 1B Instruct model is here and a GGUF for local users is here (all OLMo 2’s are here).

The reason is that we didn’t know our 1B base model was actually good enough. If you zoom in on the coming revision to the OLMo 2 paper you’ll see that the base model evaluations are largely “mid.” They’re decent enough to their peer models, but not as strong as the bigger models in the suite. We thought we had to keep pushing modeling decisions for a 1B model (such as fiddling with weight decay settings) or other things that are suitable for small models — i.e. older techniques that don’t scale up to bigger models so are out of fashion. Small model development can be handled much differently than bigger models.

This gap in how big vs. small models can be developed is part of the reason we suspect our final post-training results are strong compared to models like Gemma 3 1B or Llama 3.2 1B. Gemma 3 1B is the only model in its suite without a vision component and Llama is a multimodal release — maybe these changes made their text only performance weaker at the low end? We don’t quite now.

full evals for the 1B models attached.

You can see some things like formatting issues on DROP for Qwen or GSM8K for Gemma 3. These are the small details motivation evaluation change I’ll revisit later.

Turns out we had this 1B model sitting around for a while and it was only when we tried more pretraining tricks that we compared the post-training numbers. The post-training numbers were far better than we expected and made the model best in class! We were sitting on great results and a model the community could love for a while without knowing it was actually good.

The biggest problem here is that we don’t know how base model evaluations indicate a strong model for post-training. Trends seem to point to base model evals being the same as the evals used for post-training. Everything is about generating text well, and now chains of thought well. Qwen 3’s base model evaluations can point at this.

With these evaluations maybe the only thing to care about is perplexity over controlled chunks of text — i.e. how well the next token prediction is working.

If OLMo base models have a hard time competing in an era of mega compute, these insights will be our most valuable contributions. I’ve been pessimistic in the past about our ability to compete with the big players, but we just put out a 1B model competitive with very recent releases and we have been sitting on it for months!

The post-training for the 1B model again proved super robust. When you have a stable recipe, it works. We found the RL gains to be particularly robust — we mostly just had to let it keep running.

The biggest gap we’re trying to close now in post-training is a scalable reasoning recipe. If we want to release state of the art models on popular evaluations, scaling RL and inference-time compute is a requirement. We want to lead on problems like avoiding over-thinking, keeping reasoning usable, and so on, but we’ll see which innovations come first!

I’m personally feeling the big shift that all the leading AI labs have gone through in the last few months. Major changes in expectations comes with major changes in tooling and processes for changing. It’s exciting, but folks all over have been putting in serious effort to do that.

Let us know what you think of this 1B model. It’s been super fun to do mini research on and I suspect a lot of you will also like it for local inference tasks. What a great time to be in language modeling research.

And yes, you can make fun of us for the fact that our 1B model has 1.5B total parameters (1.3B without embedding parameters). We’ll focus on this more in the next versions — just one of those many things to get right.Image
recommend reading @vwxyzjn's thread on RL deets. Not a full reasoning model, but still super cool
Read 4 tweets
Mar 24
Okay okay, spent my weekend gooning around learning GRPO math. Here's some takes.

Essentially, this is me yapping through a recap of smaller details on how GRPO is implemented, what Dr. GRPO changes, why, DAPO, connections to PPO, aggregating batches...

Reading list below.
More coherent version coming to @interconnectsai this week.

I know this format won't be for everyone, but I hope some of you love it!
RLHF Book: rlhfbook.com/c/11-policy-gr…
DeepSeekMath paper: arxiv.org/pdf/2402.03300
Where does ratio come from in PPO? ai.stackexchange.com/questions/3795…
DAPO: arxiv.org/pdf/2503.14476
DAPO announcement: x.com/qiying_yu/stat…
My DAPO recap: x.com/natolambert/st…
Dr. GRPO: github.com/sail-sg/unders…
Dr. GRPO announcement: x.com/zzlccc/status/…
TRL GRPO implementation: github.com/huggingface/tr…
Unbiased GRPO implementation: github.com/sail-sg/oat/bl…
Thread on GRPO implementation on x: x.com/natolambert/st…
Watch on YouTube:
Read 5 tweets
Mar 17
This is a very tidy little RL paper for reasoning. Their GRPO changes:
1 Two different clip hyperparams, so positive clipping can uplift more unexpected tokens
2 Dynamic sampling -- remove samples w flat reward in batch
3 Per token loss
4 Managing too long generations in loss
1) Clip Higher
Mostly, PPO / GRPO have a clipping hyperparam. This moves it to two hypers, so the upper / positive logratio step size can be bigger. This is to increase the probability of tokens, such as surprising new tokens in reasoning chains. Image
2) remove unnecessary samples form batch. Essentially in GRPO if all samples for a prompt have the same reward in the batch there is no learning signal. Removing them improves learning speed. Image
Read 6 tweets
Dec 20, 2024
o3 does not use different training nor inference methods than o1 (at least in pro mode). No special "search".

OpenAI just found a hill and very quickly started hillclimbing it.

Excited to build an open-source one and prove this to you in 2025.
interconnects.ai/p/openais-o3-t…
@Miles_Brundage I am not psyoping myself again, just locking in for a wild ride.
very productive workday definitely wasn't distracted at all
Read 6 tweets
Dec 20, 2024
OpenAI skips o2, previews o3 scores, and they're truly crazy. Huge progress on the few benchmarks we think are truly hard today. Including ARC AGI.
Rip to people who say any of "progress is done," "scale is done," or "llms cant reason"
2024 was awesome. I love my job. Image
Image
Also, look at the x axes on some of the these plots, o3 at least partially is a "scaled up" version of o1. Can be other advancements too.
@TheXeophon worried about getting access to the $1k/query model. Don't worry, I got u. Image
Read 4 tweets
Nov 21, 2024
I've spent the last two years scouring all available resources on RLHF specifically and post training broadly. Today, with the help of a totally cracked team, we bring you the fruits of that labor — Tülu 3, an entirely open frontier model post training recipe. We beat Llama 3.1 Instruct at 8B and 70B on the tasks we focused on.

So many things to share: New SFT data, recipes for scaling preference fine-tuning, a new RL optimization stage, extensive evaluation details, etc.Image
First, try our models via our free demo, or grab them on Hugging Face.
My story behind it: buff.ly/494iXKg
8B model: buff.ly/498x15q
70B model: buff.ly/3Ok4PTp
Demo: buff.ly/492H2Rw
Website: buff.ly/4i0X2rm Image
Right to the fun stuff. To finish our models, we use a new technique called Reinforcement Learning with Verifiable Rewards, where we train on math problems or prompts with constraints, and only reward the algorithm if the generation is correct. We find this improves performance after SFT and DPO.Image
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(