Noam Brown Profile picture
Oct 2 4 tweets 2 min read Read on X
My new hobby is asking GPT-5 Thinking to find errors in every @Wikipedia page I read. Interestingly, almost every page I checked has at least one error. 🧵 Image
Image
Some of these errors can be quite serious. Image
Image
Are these just hallucinations by GPT-5 Thinking? Nope. I checked the sources and as far as I can tell they are all real errors.Image
Image
Even the Wikipedia page on Wikipedia has an error. Image
Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Noam Brown

Noam Brown Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @polynoamial

Aug 11
In my opinion, the most important takeaway from this result is that our @OpenAI International Math Olympiad (IMO) gold model is also our best competitive coding model. 🧵
After the IMO, we ran full evals on the IMO gold model and found that aside from just competitive math, it was also our best model in many other areas, including coding. So folks decided to take the same exact IMO gold model, without any changes, and use it in the system for IOI.
The IOI scaffold involved sampling from a few different models and then using another model and a heuristic to select solutions for submission. This system achieved a gold medal, placing 6th among humans. The IMO gold model indeed did best out of all the models we sampled from.
Read 5 tweets
Jul 21
Congrats to the GDM team on their IMO result! I think their parallel success highlights how fast AI progress is. Their approach was a bit different than ours, but I think that shows there are many research directions for further progress. Some thoughts on our model and results 🧵
~2 months ago, the IMO emailed us about participating in a formal (Lean) version of the IMO. We’ve been focused on general reasoning in natural language without the constraints of Lean, so we declined. We were never approached about a natural language math option.
Over the past several months, we made a lot of progress on general reasoning. This involved collecting, curating, and training on high-quality math data, which will also go into future models. In our IMO eval we did not use RAG or any tools.
Read 7 tweets
Jul 19
Today, we at @OpenAI achieved a milestone that many considered years away: gold medal-level performance on the 2025 IMO with a general reasoning LLM—under the same time limits as humans, without tools. As remarkable as that sounds, it’s even more significant than the headline 🧵
Typically for these AI results, like in Go/Dota/Poker/Diplomacy, researchers spend years making an AI that masters one narrow domain and does little else. But this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.
So what’s different? We developed new techniques that make LLMs a lot better at hard-to-verify tasks. IMO problems were the perfect challenge for this: proofs are pages long and take experts hours to grade. Compare that to AIME, where answers are simply an integer from 0 to 999.
Read 8 tweets
Sep 12, 2024
Today, I’m excited to share with you all the fruit of our effort at @OpenAI to create AI models capable of truly general reasoning: OpenAI's new o1 model series! (aka 🍓) Let me explain 🧵 1/Image
@OpenAI Our o1-preview and o1-mini models are available immediately. We’re also sharing evals for our (still unfinalized) o1 model to show the world that this isn’t a one-off improvement – it’s a new scaling paradigm and we’re just getting started. 2/9 Image
@OpenAI o1 is trained with RL to “think” before responding via a private chain of thought. The longer it thinks, the better it does on reasoning tasks. This opens up a new dimension for scaling. We’re no longer bottlenecked by pretraining. We can now scale inference compute too. Image
Read 10 tweets
Jul 6, 2023
I’m thrilled to share that I've joined @OpenAI! 🚀 For years I’ve researched AI self-play and reasoning in games like Poker and Diplomacy. I’ll now investigate how to make these methods truly general. If successful, we may one day see LLMs that are 1,000x better than GPT-4 🌌 1/
In 2016, AlphaGo beat Lee Sedol in a milestone for AI. But key to that was the AI's ability to "ponder" for ~1 minute before each move. How much did that improve it? For AlphaGoZero, it's the equivalent of scaling pretraining by ~100,000x (~5200 Elo with search, ~3000 without) 2/ Image
Also in 2016, I observed a similar phenomenon in poker. That insight led to our Libratus poker AI that beat top humans for the first time. @andy_l_jones investigated the train-time/test-time compute tradeoff in detail in Hex and found a similar pattern: 3/
Read 6 tweets
Nov 22, 2022
3 years ago my teammates and I set out toward a goal that seemed like science fiction: to build an AI that could strategically outnegotiate humans *in natural language* in Diplomacy. Today, I’m excited to share our Science paper showing we’ve succeeded! 🧵
2/ Diplomacy is a 7-player game best described as a mix of Risk, poker, and Survivor. It was JFK’s favorite game. @demishassabis is a former champion in it. And it’s been a decades-old, seemingly impossible grand challenge for AI. Why?
3/ Diplomacy is about building trust in an environment that encourages players to not trust anyone. All players act simultaneously after non-binding, private negotiations. To succeed, you must account for the risk that players might lie, and that players might doubt your honesty.
Read 14 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(