Kyle Lo Profile picture
Nov 20 7 tweets 5 min read Read on X
we released Olmo 3! lot of exciting stuff but wanna focus on:

🐟Olmo 3 32B Base, the best fully-open base model to-date, near Qwen 2.5 & Gemma 3 on diverse evals

🐠Olmo 3 32B Think, first fully-open reasoning model approaching Qwen 3 levels

🐡12 training datasets corresp to different staged training recipes, all open & accessible

since I'm a pretraining person, I'll share some of my fav Base model ideas:Image
Image
🦈Invest in your experimental design!

"Fit scaling ladders" is right, but there's more we can understand about suitability of different benchmarks for scaling experiments

We create evals better suited for different compute scales, with our "easy" set of tasks+metrics able to support very small scale experiments before switching to our "main" set of evals, on which smaller models are below noise floorImage
🍣Data mixing is a little too powerful

It's really easy to accidentally learn "optimal" mixes that oversample from certain pockets heavily. For example, "STEM" docs are really valuable for climbing tasks like MMLU & mixing methods can propose very high weight.

But realistically, you don't have infinite "STEM" docs. Upsampling is fine but only up to an extent, after which, you may actually benefit more from sampling new docs that aren't the "best" but still "good"

We approach mixing as a Token Constrained Optimization problem & further penalize any solution that tanks any one individual task too hardImage
🍨Data quality signals matter but also how you use them!

Traditional ways of using data quality is to threshold: Define a cutoff and take all the documents above that threshold.

But why not sample *proportional* to data quality?

We use Quality-Aware Upsampling to do exactly this. Super simple idea and it works really well!Image
🍕Finally, we all know midtraining is an exciting time to get a ton of performance boost

But the midtraining development loop kind of sucks. It's exhausting to hunt for problems in the model, create specific data that patches it, and retrain, over and over.

Team organization to sustain consistent model improvements (without burnout) is important. We had different teammates "own" target capabilities to reduce thrash & a centralized assessment team define experimental methodology & eval standards.

Explorers would try things in distributed manner, propose best ideas back (with evidence). And the team defines and conducts regular "integration tests" where we combine ideas and see if they work together. If so, it's the new baseline to beat & everyone goes exploring again!Image
🍏Try the model: playground.allenai.org
🍍Download the collection: huggingface.co/collections/al…
🍌Read the blog: allenai.org/blog/olmo3
🍎And our 100+ page paper lol 🤪: datocms-assets.com/64837/17636468…

more linked from the Ai2 post x.com/allen_ai/statu…
We're hiring too!

Olmo 3 was our biggest effort yet, but we're still a small team (67 authors!) compared to a lot of the big labs, which means everyone gets to own a major piece of the Olmo puzzle.

In particular, we're hiring student interns who have been critical to our project's success. So if you're interested in doing research with us + contribute to open science + get enthusiastic support to publish your work, hope to see you apply:

job-boards.greenhouse.io/thealleninstit…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Kyle Lo

Kyle Lo Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @kylelostat

Apr 25, 2023
The Semantic Reader project combines AI & HCI research to explore the future of scientific reading. We ask: “Can we create intelligent, interactive, and accessible reading interfaces for research papers, even atop existing PDFs?” 1/nsemanticscholar.org/reader/67a5bac…
Papers can be hard to read! In the last 2 years we've prototyped 10 interfaces to better support scholarly reading. We address 5 key user challenges: Discovery, Comprehension, Efficiency, Synthesis, and Accessibility. 2/10
🔭Discovery: Forward/backward citation chasing is useful but overwhelming. CiteSee (#CHI2023🏆) helps scholars prioritize inline citations with personalized context based on past activity. CiteRead (#IUI2022) surfaces relevant citation contexts for exploring follow-on work. 3/10
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(