Figuring out AI @allen_ai, open models, RLHF, fine-tuning, etc
Contact via email.
Writes @interconnectsai
Wrote The RLHF Book
Mountain runner
May 1 • 4 tweets • 4 min read
Stoked to get the 1B OLMo 2 model out -- this will likely be our most used model. Getting a 1B model we were happy with was a wandering path. I've written some lessons from training it below.
Astute followers of AI releases should be a bit confused by why we are releasing a 1B model as the last one of our releases with OLMo 2. The first models dropped in November of 2024 and we let the 32B cook over the holidays when compute demand was lower — why the heck a 1B now?
The 1B Instruct model is here and a GGUF for local users is here (all OLMo 2’s are here).
The reason is that we didn’t know our 1B base model was actually good enough. If you zoom in on the coming revision to the OLMo 2 paper you’ll see that the base model evaluations are largely “mid.” They’re decent enough to their peer models, but not as strong as the bigger models in the suite. We thought we had to keep pushing modeling decisions for a 1B model (such as fiddling with weight decay settings) or other things that are suitable for small models — i.e. older techniques that don’t scale up to bigger models so are out of fashion. Small model development can be handled much differently than bigger models.
This gap in how big vs. small models can be developed is part of the reason we suspect our final post-training results are strong compared to models like Gemma 3 1B or Llama 3.2 1B. Gemma 3 1B is the only model in its suite without a vision component and Llama is a multimodal release — maybe these changes made their text only performance weaker at the low end? We don’t quite now.
full evals for the 1B models attached.
You can see some things like formatting issues on DROP for Qwen or GSM8K for Gemma 3. These are the small details motivation evaluation change I’ll revisit later.
Turns out we had this 1B model sitting around for a while and it was only when we tried more pretraining tricks that we compared the post-training numbers. The post-training numbers were far better than we expected and made the model best in class! We were sitting on great results and a model the community could love for a while without knowing it was actually good.
The biggest problem here is that we don’t know how base model evaluations indicate a strong model for post-training. Trends seem to point to base model evals being the same as the evals used for post-training. Everything is about generating text well, and now chains of thought well. Qwen 3’s base model evaluations can point at this.
With these evaluations maybe the only thing to care about is perplexity over controlled chunks of text — i.e. how well the next token prediction is working.
If OLMo base models have a hard time competing in an era of mega compute, these insights will be our most valuable contributions. I’ve been pessimistic in the past about our ability to compete with the big players, but we just put out a 1B model competitive with very recent releases and we have been sitting on it for months!
The post-training for the 1B model again proved super robust. When you have a stable recipe, it works. We found the RL gains to be particularly robust — we mostly just had to let it keep running.
The biggest gap we’re trying to close now in post-training is a scalable reasoning recipe. If we want to release state of the art models on popular evaluations, scaling RL and inference-time compute is a requirement. We want to lead on problems like avoiding over-thinking, keeping reasoning usable, and so on, but we’ll see which innovations come first!
I’m personally feeling the big shift that all the leading AI labs have gone through in the last few months. Major changes in expectations comes with major changes in tooling and processes for changing. It’s exciting, but folks all over have been putting in serious effort to do that.
Let us know what you think of this 1B model. It’s been super fun to do mini research on and I suspect a lot of you will also like it for local inference tasks. What a great time to be in language modeling research.
And yes, you can make fun of us for the fact that our 1B model has 1.5B total parameters (1.3B without embedding parameters). We’ll focus on this more in the next versions — just one of those many things to get right.
Micro blog form: natolambert.substack.com/p/in-between-t…
Mar 24 • 5 tweets • 4 min read
Okay okay, spent my weekend gooning around learning GRPO math. Here's some takes.
Essentially, this is me yapping through a recap of smaller details on how GRPO is implemented, what Dr. GRPO changes, why, DAPO, connections to PPO, aggregating batches...
Reading list below.
More coherent version coming to @interconnectsai this week.
This is a very tidy little RL paper for reasoning. Their GRPO changes:
1 Two different clip hyperparams, so positive clipping can uplift more unexpected tokens
2 Dynamic sampling -- remove samples w flat reward in batch
3 Per token loss
4 Managing too long generations in loss
1) Clip Higher
Mostly, PPO / GRPO have a clipping hyperparam. This moves it to two hypers, so the upper / positive logratio step size can be bigger. This is to increase the probability of tokens, such as surprising new tokens in reasoning chains.
Dec 20, 2024 • 6 tweets • 1 min read
o3 does not use different training nor inference methods than o1 (at least in pro mode). No special "search".
OpenAI just found a hill and very quickly started hillclimbing it.
Excited to build an open-source one and prove this to you in 2025. interconnects.ai/p/openais-o3-t…
@Miles_Brundage I am not psyoping myself again, just locking in for a wild ride.
Dec 20, 2024 • 4 tweets • 1 min read
OpenAI skips o2, previews o3 scores, and they're truly crazy. Huge progress on the few benchmarks we think are truly hard today. Including ARC AGI.
Rip to people who say any of "progress is done," "scale is done," or "llms cant reason"
2024 was awesome. I love my job.
Also, look at the x axes on some of the these plots, o3 at least partially is a "scaled up" version of o1. Can be other advancements too.
Nov 21, 2024 • 8 tweets • 5 min read
I've spent the last two years scouring all available resources on RLHF specifically and post training broadly. Today, with the help of a totally cracked team, we bring you the fruits of that labor — Tülu 3, an entirely open frontier model post training recipe. We beat Llama 3.1 Instruct at 8B and 70B on the tasks we focused on.
So many things to share: New SFT data, recipes for scaling preference fine-tuning, a new RL optimization stage, extensive evaluation details, etc.
First, try our models via our free demo, or grab them on Hugging Face.
My story behind it: buff.ly/494iXKg
8B model: buff.ly/498x15q
70B model: buff.ly/3Ok4PTp
Demo: buff.ly/492H2Rw
Website: buff.ly/4i0X2rm
May 9, 2023 • 5 tweets • 3 min read
Here is the alpha version of our coding assistant: StarChat. 🌟💬
This is a quick instruction-tune of @BigCodeProject 's StarCoder model. Next we'll improve usability with RLHF.
We include a lot in this blog post, including:
* how to train the 16Billion parameter model with DeepSpeed
* running ChatGPT/GPT4 evaluations and its limitations
* evaluation of this instruction-tuned model vs base
* more!
Oct 24, 2022 • 21 tweets • 3 min read
Open problems in RL research thread (infrastructure, methodological, philosophical, ...).
Why does it feel like the field is expanding horizontally (more research, more general progress) without breakthroughs in academic (using the same 3+ year old algorithms)?
Add some⬇️
2
environments are hard to set up and share.
every grad student has war stories of dealing with mujoco.
now past research code has non maintained mujoco.
how many years did this lose?
Mar 9, 2021 • 9 tweets • 4 min read
Ever wonder what the limits of current Deep RL algorithms are with better hyperparameter tuning?
The answer (with model-based RL): way better than we thought. It breaks the simulator.
1/⬇️ presented at @aistats_conf 2021
The work, "On the Importance of Hyperparameter Optimization for Model-based Reinforcement Learning" shows that model-based RL can achieve literal god mode performance with the right parameter tuning.