Announcing Olmo 3, a leading fully open LM suite built for reasoning, chat, & tool use, and an open model flow—not just the final weights, but the entire training journey.
Best fully open 32B reasoning model & best 32B base model. 🧵
Most models ship as a single opaque snapshot. Olmo 3 opens the model flow – pretraining, mid-training, & post-training – plus data recipes & code so you can see how capabilities are built + customize any stage.
Meet the Olmo 3 family:
🏗️ Olmo 3-Base (7B, 32B)—foundations for post-training with strong code, math, & reading comprehension skills
🛠️ Olmo 3-Instruct (7B)—multi-turn chat + tool use
🧠 Olmo 3-Think (7B, 32B)—“thinking” models that show their reasoning
All designed to run on hardware from laptops to research clusters.
At the center is Olmo 3-Think (32B)—a fully open 32B-scale reasoning model. We see 32B as a sweet spot: a large jump in reasoning over 7B, but still small enough for many users to fine-tune and study. 💡
We trained Olmo 3 on ~6T tokens from our new Dolma 3 pretraining dataset + new post-training sets featuring stronger data decontamination and richer math/code/reasoning mixes.
A long-context extension pushes Olmo 3’s context window to ~65K tokens (~48K words)—enough for full papers, books, & other long files.
Olmo 3 packs a punch. In our evals:
⦿ Olmo 3-Think (32B) is the strongest fully open 32B reasoner
⦿ Olmo 3-Base models beat fully open Marin & Apertus, rival Qwen 2.5 & Gemma 3
⦿ Olmo 3-Instruct (7B) ties or bests Qwen 2.5, Gemma 3 & Llama 3.1 on tough benchmarks
Rolling out alongside Olmo 3: a big Ai2 Playground upgrade ↴
🤔 Thinking mode—see intermediate reasoning on complex tasks
🧰 Tool calling—define JSON-schema tools or call tools in our Asta platform
Olmo 3 is wired into OlmoTrace in the Ai2 Playground, so you don’t just see its behavior—you can trace it.
For example, you can ask Olmo 3-Think (32B) to answer a general-knowledge question, then use OlmoTrace to inspect where and how the model may have learned to generate parts of its response. 🧑🎓
If you care about AI that you can customize & improve, Olmo 3 is for you—available now under Apache 2.0.
Dive deep with Olmo leads Hanna Hajishirzi and Noah Smith about how & why we built Olmo 3, and what comes next.
Today we’re releasing Deep Research Tulu (DR Tulu)—the first fully open, end-to-end recipe for long-form deep research, plus an 8B agent you can use right away. Train agents that plan, search, synthesize, & cite across sources, making expert research more accessible. 🧭📚
Complex questions rarely have a straightforward answer. A deep research agent has to decide what to look up, read across many sources, figure out what actually matters, & then explain it clearly with citations—not just spit out a summary.
DR Tulu’s key idea: Reinforcement Learning with Evolving Rubrics (RLER)—rubric rewards that:
◈ Use instance-specific, search-grounded criteria
◈ Are grounded in knowledge
◈ Evolve with training to capture strategies & failure modes, reducing reward hacking. 📈
Introducing OlmoEarth 🌍, state-of-the-art AI foundation models paired with ready-to-use open infrastructure to turn Earth data into clear, up-to-date insights within hours—not years.
By applying AI to a planet’s worth of data, OlmoEarth Platform is already empowering communities to act faster & with confidence to secure a sustainable future. 🌲
OlmoEarth delivers intelligence to people on the ground for anything from aiding restoration efforts to protecting natural resources and communities.
Under the hood is our industry-leading OlmoEarth foundation model family—AI that fuses 🛰️ satellite imagery, 📡 radar, ⛰️ elevation, & 🗺️ detailed map layers.
Open, fast to adapt + deploy—industry-leading on key benchmarks and real-world applications for our partners.
Learn more → allenai.org/blog/olmoearth…
📢 New paper from Ai2: Signal & Noise asks a simple question—can language model benchmarks detect a true difference in model performance? 🧵
🧪 After analyzing 30 benchmarks & 465 open-weight models, the verdict is clear: a simple metric, signal-to-noise ratio (SNR), can reveal which benchmarks are actually informative for making decisions between two models.
📡 Signal = a benchmark’s ability to separate strong models from poor performers
📊 Noise = sensitivity to random variability between training steps
🔬 Benchmarks that can separate models and exhibit low noise during a model’s training are far more reliable for model eval
Ever wonder how LLM developers choose their pretraining data? It’s not guesswork— all AI labs create small-scale models as experiments, but the models and their data are rarely shared.
DataDecide opens up the process: 1,050 models, 30k checkpoints, 25 datasets & 10 benchmarks 🧵
🔮DataDecide measures how accurately small experiments (1B parameters, 100B tokens, 3 seeds) predict the real ranking of large runs. This helps us make the most cost-effective decisions for our training runs. 💸
🎯 Picking the right benchmark to hillclimb on makes all the difference.
Some tasks are much more sensitive at small scale:
💡 MMLU & ARC Easy give strong signals earlier than HellaSwag
🚫 Others stay hard to predict—even at larger scales
For years it’s been an open question — how much is a language model learning and synthesizing information, and how much is it just memorizing and reciting?
Introducing OLMoTrace, a new feature in the Ai2 Playground that begins to shed some light. 🔦
OLMoTrace connects phrases or even whole sentences in the language model’s output back to verbatim matches in its training data. It does this by searching billions of documents and trillions of tokens in real time and highlighting where it finds compelling matches.
OLMoTrace is useful for fact checking✅, understanding hallucinations🎃, tracing reasoning capabilities🧠, or just generally helping you see where an LLMs response may have come from.
Meet Ai2 Paper Finder, an LLM-powered literature search system.
Searching for relevant work is a multi-step process that requires iteration. Paper Finder mimics this workflow — and helps researchers find more papers than ever 🔍
Paper Finder breaks down your query into relevant components, such as searching for papers, following citations, evaluating for relevance, and running follow-up queries based on the results. It then presents not only the papers, but also short summaries of why the paper is relevant to your specific query.
Compared to other tools that focus on returning a few popular results, Paper Finder aims to cover the long tail of more niche findings and hard-to-find papers that requires an iterative searching process. We believe this scope can better serve researchers who are experts in their fields. 🕵️