Ai2 Profile picture
Nov 20 10 tweets 4 min read Read on X
Announcing Olmo 3, a leading fully open LM suite built for reasoning, chat, & tool use, and an open model flow—not just the final weights, but the entire training journey.
Best fully open 32B reasoning model & best 32B base model. 🧵 Image
Most models ship as a single opaque snapshot. Olmo 3 opens the model flow – pretraining, mid-training, & post-training – plus data recipes & code so you can see how capabilities are built + customize any stage.
Meet the Olmo 3 family:
🏗️ Olmo 3-Base (7B, 32B)—foundations for post-training with strong code, math, & reading comprehension skills
🛠️ Olmo 3-Instruct (7B)—multi-turn chat + tool use
🧠 Olmo 3-Think (7B, 32B)—“thinking” models that show their reasoning
All designed to run on hardware from laptops to research clusters.Image
At the center is Olmo 3-Think (32B)—a fully open 32B-scale reasoning model. We see 32B as a sweet spot: a large jump in reasoning over 7B, but still small enough for many users to fine-tune and study. 💡
We trained Olmo 3 on ~6T tokens from our new Dolma 3 pretraining dataset + new post-training sets featuring stronger data decontamination and richer math/code/reasoning mixes.
A long-context extension pushes Olmo 3’s context window to ~65K tokens (~48K words)—enough for full papers, books, & other long files.
Olmo 3 packs a punch. In our evals:
⦿ Olmo 3-Think (32B) is the strongest fully open 32B reasoner
⦿ Olmo 3-Base models beat fully open Marin & Apertus, rival Qwen 2.5 & Gemma 3
⦿ Olmo 3-Instruct (7B) ties or bests Qwen 2.5, Gemma 3 & Llama 3.1 on tough benchmarks
Rolling out alongside Olmo 3: a big Ai2 Playground upgrade ↴
🤔 Thinking mode—see intermediate reasoning on complex tasks
🧰 Tool calling—define JSON-schema tools or call tools in our Asta platform
Olmo 3 is wired into OlmoTrace in the Ai2 Playground, so you don’t just see its behavior—you can trace it.
For example, you can ask Olmo 3-Think (32B) to answer a general-knowledge question, then use OlmoTrace to inspect where and how the model may have learned to generate parts of its response. 🧑‍🎓Image
If you care about AI that you can customize & improve, Olmo 3 is for you—available now under Apache 2.0.
Dive deep with Olmo leads Hanna Hajishirzi and Noah Smith about how & why we built Olmo 3, and what comes next.
✨ Try Olmo 3 in the Ai2 Playground → playground.allenai.org/?utm_source=x&… & our Discord → discord.gg/ai2
💻 Download: huggingface.co/collections/al…
📝 Blog: allenai.org/blog/olmo3?utm…
📚 Technical report: allenai.org/papers/olmo3?u…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Ai2

Ai2 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @allen_ai

Nov 18
Today we’re releasing Deep Research Tulu (DR Tulu)—the first fully open, end-to-end recipe for long-form deep research, plus an 8B agent you can use right away. Train agents that plan, search, synthesize, & cite across sources, making expert research more accessible. 🧭📚
Complex questions rarely have a straightforward answer. A deep research agent has to decide what to look up, read across many sources, figure out what actually matters, & then explain it clearly with citations—not just spit out a summary.
DR Tulu’s key idea: Reinforcement Learning with Evolving Rubrics (RLER)—rubric rewards that:
◈ Use instance-specific, search-grounded criteria
◈ Are grounded in knowledge
◈ Evolve with training to capture strategies & failure modes, reducing reward hacking. 📈 Image
Read 7 tweets
Nov 4
Introducing OlmoEarth 🌍, state-of-the-art AI foundation models paired with ready-to-use open infrastructure to turn Earth data into clear, up-to-date insights within hours—not years.
By applying AI to a planet’s worth of data, OlmoEarth Platform is already empowering communities to act faster & with confidence to secure a sustainable future. 🌲
OlmoEarth delivers intelligence to people on the ground for anything from aiding restoration efforts to protecting natural resources and communities.
Under the hood is our industry-leading OlmoEarth foundation model family—AI that fuses 🛰️ satellite imagery, 📡 radar, ⛰️ elevation, & 🗺️ detailed map layers.
Open, fast to adapt + deploy—industry-leading on key benchmarks and real-world applications for our partners.
Learn more → allenai.org/blog/olmoearth…Image
Read 8 tweets
Aug 19
📢 New paper from Ai2: Signal & Noise asks a simple question—can language model benchmarks detect a true difference in model performance? 🧵 Image
🧪 After analyzing 30 benchmarks & 465 open-weight models, the verdict is clear: a simple metric, signal-to-noise ratio (SNR), can reveal which benchmarks are actually informative for making decisions between two models.
📡 Signal = a benchmark’s ability to separate strong models from poor performers
📊 Noise = sensitivity to random variability between training steps
🔬 Benchmarks that can separate models and exhibit low noise during a model’s training are far more reliable for model eval Image
Read 6 tweets
Apr 15
Ever wonder how LLM developers choose their pretraining data? It’s not guesswork— all AI labs create small-scale models as experiments, but the models and their data are rarely shared.
DataDecide opens up the process: 1,050 models, 30k checkpoints, 25 datasets & 10 benchmarks 🧵Plot shows the relationship between compute used to predict a ranking of datasets and how accurately that ranking reflects performance at the target (1B) scale of models pretrained from scratch on those datasets.
🔮DataDecide measures how accurately small experiments (1B parameters, 100B tokens, 3 seeds) predict the real ranking of large runs. This helps us make the most cost-effective decisions for our training runs. 💸
🎯 Picking the right benchmark to hillclimb on makes all the difference.
Some tasks are much more sensitive at small scale:
💡 MMLU & ARC Easy give strong signals earlier than HellaSwag
🚫 Others stay hard to predict—even at larger scalesThe amount of compute needed to make good predictions varies between tasks. ARC and MMLU make good predictions with less than 2 orders of magnitude of target compute.
Read 8 tweets
Apr 9
For years it’s been an open question — how much is a language model learning and synthesizing information, and how much is it just memorizing and reciting?

Introducing OLMoTrace, a new feature in the Ai2 Playground that begins to shed some light. 🔦
OLMoTrace connects phrases or even whole sentences in the language model’s output back to verbatim matches in its training data. It does this by searching billions of documents and trillions of tokens in real time and highlighting where it finds compelling matches. Screenshot of how OLMoTrace works in the Ai2 Playground interface. Prompt asks, "Who is Celine Dion?" The model outputs four paragraphs with several text spans highlighted. On the right, the OLMoTrace panel displays several documents from the training data that match the highlighted text spans in the model output.
OLMoTrace is useful for fact checking✅, understanding hallucinations🎃, tracing reasoning capabilities🧠, or just generally helping you see where an LLMs response may have come from. Screenshot of OLMoTrace in Ai2 Playground. In the model output, the highlighted text span reads "The Space Needle was built for the 1962 World's Fair". In the OLMoTrace panel on the right, a pre-training document from olmo-mix-1124 contains an exact text match.
Read 6 tweets
Mar 26
Meet Ai2 Paper Finder, an LLM-powered literature search system.

Searching for relevant work is a multi-step process that requires iteration. Paper Finder mimics this workflow — and helps researchers find more papers than ever 🔍 Screenshot of the Ai2 Paper Finder interface
Paper Finder breaks down your query into relevant components, such as searching for papers, following citations, evaluating for relevance, and running follow-up queries based on the results. It then presents not only the papers, but also short summaries of why the paper is relevant to your specific query.Part of Ai2 Paper Finders reasoning process while it searches a query
Compared to other tools that focus on returning a few popular results, Paper Finder aims to cover the long tail of more niche findings and hard-to-find papers that requires an iterative searching process. We believe this scope can better serve researchers who are experts in their fields. 🕵️
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(