Stoked to get the 1B OLMo 2 model out -- this will likely be our most used model. Getting a 1B model we were happy with was a wandering path. I've written some lessons from training it below.
Astute followers of AI releases should be a bit confused by why we are releasing a 1B model as the last one of our releases with OLMo 2. The first models dropped in November of 2024 and we let the 32B cook over the holidays when compute demand was lower — why the heck a 1B now?
The 1B Instruct model is here and a GGUF for local users is here (all OLMo 2’s are here).
The reason is that we didn’t know our 1B base model was actually good enough. If you zoom in on the coming revision to the OLMo 2 paper you’ll see that the base model evaluations are largely “mid.” They’re decent enough to their peer models, but not as strong as the bigger models in the suite. We thought we had to keep pushing modeling decisions for a 1B model (such as fiddling with weight decay settings) or other things that are suitable for small models — i.e. older techniques that don’t scale up to bigger models so are out of fashion. Small model development can be handled much differently than bigger models.
This gap in how big vs. small models can be developed is part of the reason we suspect our final post-training results are strong compared to models like Gemma 3 1B or Llama 3.2 1B. Gemma 3 1B is the only model in its suite without a vision component and Llama is a multimodal release — maybe these changes made their text only performance weaker at the low end? We don’t quite now.
full evals for the 1B models attached.
You can see some things like formatting issues on DROP for Qwen or GSM8K for Gemma 3. These are the small details motivation evaluation change I’ll revisit later.
Turns out we had this 1B model sitting around for a while and it was only when we tried more pretraining tricks that we compared the post-training numbers. The post-training numbers were far better than we expected and made the model best in class! We were sitting on great results and a model the community could love for a while without knowing it was actually good.
The biggest problem here is that we don’t know how base model evaluations indicate a strong model for post-training. Trends seem to point to base model evals being the same as the evals used for post-training. Everything is about generating text well, and now chains of thought well. Qwen 3’s base model evaluations can point at this.
With these evaluations maybe the only thing to care about is perplexity over controlled chunks of text — i.e. how well the next token prediction is working.
If OLMo base models have a hard time competing in an era of mega compute, these insights will be our most valuable contributions. I’ve been pessimistic in the past about our ability to compete with the big players, but we just put out a 1B model competitive with very recent releases and we have been sitting on it for months!
The post-training for the 1B model again proved super robust. When you have a stable recipe, it works. We found the RL gains to be particularly robust — we mostly just had to let it keep running.
The biggest gap we’re trying to close now in post-training is a scalable reasoning recipe. If we want to release state of the art models on popular evaluations, scaling RL and inference-time compute is a requirement. We want to lead on problems like avoiding over-thinking, keeping reasoning usable, and so on, but we’ll see which innovations come first!
I’m personally feeling the big shift that all the leading AI labs have gone through in the last few months. Major changes in expectations comes with major changes in tooling and processes for changing. It’s exciting, but folks all over have been putting in serious effort to do that.
Let us know what you think of this 1B model. It’s been super fun to do mini research on and I suspect a lot of you will also like it for local inference tasks. What a great time to be in language modeling research.
And yes, you can make fun of us for the fact that our 1B model has 1.5B total parameters (1.3B without embedding parameters). We’ll focus on this more in the next versions — just one of those many things to get right.
Okay okay, spent my weekend gooning around learning GRPO math. Here's some takes.
Essentially, this is me yapping through a recap of smaller details on how GRPO is implemented, what Dr. GRPO changes, why, DAPO, connections to PPO, aggregating batches...
Reading list below.
More coherent version coming to @interconnectsai this week.
This is a very tidy little RL paper for reasoning. Their GRPO changes:
1 Two different clip hyperparams, so positive clipping can uplift more unexpected tokens
2 Dynamic sampling -- remove samples w flat reward in batch
3 Per token loss
4 Managing too long generations in loss
1) Clip Higher
Mostly, PPO / GRPO have a clipping hyperparam. This moves it to two hypers, so the upper / positive logratio step size can be bigger. This is to increase the probability of tokens, such as surprising new tokens in reasoning chains.
2) remove unnecessary samples form batch. Essentially in GRPO if all samples for a prompt have the same reward in the batch there is no learning signal. Removing them improves learning speed.
OpenAI skips o2, previews o3 scores, and they're truly crazy. Huge progress on the few benchmarks we think are truly hard today. Including ARC AGI.
Rip to people who say any of "progress is done," "scale is done," or "llms cant reason"
2024 was awesome. I love my job.
Also, look at the x axes on some of the these plots, o3 at least partially is a "scaled up" version of o1. Can be other advancements too.
@TheXeophon worried about getting access to the $1k/query model. Don't worry, I got u.
I've spent the last two years scouring all available resources on RLHF specifically and post training broadly. Today, with the help of a totally cracked team, we bring you the fruits of that labor — Tülu 3, an entirely open frontier model post training recipe. We beat Llama 3.1 Instruct at 8B and 70B on the tasks we focused on.
So many things to share: New SFT data, recipes for scaling preference fine-tuning, a new RL optimization stage, extensive evaluation details, etc.
Right to the fun stuff. To finish our models, we use a new technique called Reinforcement Learning with Verifiable Rewards, where we train on math problems or prompts with constraints, and only reward the algorithm if the generation is correct. We find this improves performance after SFT and DPO.