Noam Brown Profile picture
Jul 21 7 tweets 2 min read Read on X
Congrats to the GDM team on their IMO result! I think their parallel success highlights how fast AI progress is. Their approach was a bit different than ours, but I think that shows there are many research directions for further progress. Some thoughts on our model and results 🧵
~2 months ago, the IMO emailed us about participating in a formal (Lean) version of the IMO. We’ve been focused on general reasoning in natural language without the constraints of Lean, so we declined. We were never approached about a natural language math option.
Over the past several months, we made a lot of progress on general reasoning. This involved collecting, curating, and training on high-quality math data, which will also go into future models. In our IMO eval we did not use RAG or any tools.
We had each submitted proof graded by 3 external IMO medalists and there was unanimous consensus on correctness. We have also posted the proofs publicly so that anyone can verify correctness. github.com/aw31/openai-im… x.com/alexwei_/statu…
Before we shared our results, we spoke with an IMO board member, who asked us to wait until after the award ceremony to make it public, a request we happily honored.
We announced at ~1am PT (6pm AEST), after the award ceremony concluded. At no point did anyone request that we announce later than that.
More than anything, we’re excited to share our progress and results with the world. AI reasoning capabilities are progressing fast, and these IMO results really show it.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Noam Brown

Noam Brown Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @polynoamial

Jul 19
Today, we at @OpenAI achieved a milestone that many considered years away: gold medal-level performance on the 2025 IMO with a general reasoning LLM—under the same time limits as humans, without tools. As remarkable as that sounds, it’s even more significant than the headline 🧵
Typically for these AI results, like in Go/Dota/Poker/Diplomacy, researchers spend years making an AI that masters one narrow domain and does little else. But this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.
So what’s different? We developed new techniques that make LLMs a lot better at hard-to-verify tasks. IMO problems were the perfect challenge for this: proofs are pages long and take experts hours to grade. Compare that to AIME, where answers are simply an integer from 0 to 999.
Read 8 tweets
Sep 12, 2024
Today, I’m excited to share with you all the fruit of our effort at @OpenAI to create AI models capable of truly general reasoning: OpenAI's new o1 model series! (aka 🍓) Let me explain 🧵 1/Image
@OpenAI Our o1-preview and o1-mini models are available immediately. We’re also sharing evals for our (still unfinalized) o1 model to show the world that this isn’t a one-off improvement – it’s a new scaling paradigm and we’re just getting started. 2/9 Image
@OpenAI o1 is trained with RL to “think” before responding via a private chain of thought. The longer it thinks, the better it does on reasoning tasks. This opens up a new dimension for scaling. We’re no longer bottlenecked by pretraining. We can now scale inference compute too. Image
Read 10 tweets
Jul 6, 2023
I’m thrilled to share that I've joined @OpenAI! 🚀 For years I’ve researched AI self-play and reasoning in games like Poker and Diplomacy. I’ll now investigate how to make these methods truly general. If successful, we may one day see LLMs that are 1,000x better than GPT-4 🌌 1/
In 2016, AlphaGo beat Lee Sedol in a milestone for AI. But key to that was the AI's ability to "ponder" for ~1 minute before each move. How much did that improve it? For AlphaGoZero, it's the equivalent of scaling pretraining by ~100,000x (~5200 Elo with search, ~3000 without) 2/ Image
Also in 2016, I observed a similar phenomenon in poker. That insight led to our Libratus poker AI that beat top humans for the first time. @andy_l_jones investigated the train-time/test-time compute tradeoff in detail in Hex and found a similar pattern: 3/
Read 6 tweets
Nov 22, 2022
3 years ago my teammates and I set out toward a goal that seemed like science fiction: to build an AI that could strategically outnegotiate humans *in natural language* in Diplomacy. Today, I’m excited to share our Science paper showing we’ve succeeded! 🧵
2/ Diplomacy is a 7-player game best described as a mix of Risk, poker, and Survivor. It was JFK’s favorite game. @demishassabis is a former champion in it. And it’s been a decades-old, seemingly impossible grand challenge for AI. Why?
3/ Diplomacy is about building trust in an environment that encourages players to not trust anyone. All players act simultaneously after non-binding, private negotiations. To succeed, you must account for the risk that players might lie, and that players might doubt your honesty.
Read 14 tweets
Oct 12, 2022
After building on years of work from MILA, DeepMind, ourselves, and others, our AIs are now expert-human-level in no-press Diplomacy and Hanabi! Unlike Go and Dota, Diplomacy/Hanabi involve *cooperation*, which breaks naive RL. arxiv.org/abs/2210.05492 arxiv.org/abs/2210.05125 🧵👇
In two-player zero-sum games like Go/Poker/Dota, principled self-play RL converges to a perfect strategy. A scalable algorithm with enough capacity/compute is all you need. But self-play RL alone may not play well with humans in *cooperative* games, even with *infinite* compute.
That's because cooperative games may have many incompatible conventions, and to play well with humans an AI needs to understand these human conventions. Language is one example. Self-play RL w/o human data won't converge to using English in a cooperative game with communication.
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(