Tamay Besiroglu Profile picture
Dec 21 12 tweets 3 min read Read on X
1/11 I’m genuinely impressed by OpenAI’s 25.2% Pass@1 performance on FrontierMath—this marks a major leap from prior results and arrives about a year ahead of my median expectations. Image
2/11 For context, FrontierMath is a brutally difficult benchmark with problems that would stump many mathematicians. The easier problems are as hard as IMO/Putnam; the hardest ones approach research-level complexity.
3/11 With earlier models like o1-preview, Pass@1 performance (solving on first attempt) was only around 2%. When allowing 8 attempts per problem (Pass@8) and counting problems solved at least once, we saw ~6% performance. o3's 25.2% at Pass@1 is substantially more impressive.
4/11 It’s important to note that while the average problem difficulty is extremely high, FrontierMath problems vary in difficulty.

Roughly: 25% are Tier 1 (advanced IMO/Putnam level), 50% are Tier 2 (extremely challenging grad-level), and 25% are Tier 3 (research problems).
5/11 Even Tier 1 and 2 problems can take hours for experts to solve. Tier 3 problems are what Tao and Gowers called "exceptionally hard," often requiring days of effort from top mathematicians.
6/11 Because of this range in difficulty, it’s too soon to say o3 excels at the hardest research-level tasks. It’s likely solving some Tier 3 problems, but I suspect its average performance on them remains fairly low.
7/11 How do they claim 25.2% was achieved? During the release, they said they used “test time settings”.

For o1, which reported results with similar solid/shaded bars that, they used majority vote (consensus) with 64 samples. Image
Image
8/11 For ARC-AGI, OpenAI scaled to tens of dollars to a few thousands per task.
If they applied similar scaling to FrontierMath, this would represent inference scaling of 2-3 OOMs above baseline. Image
9/11 This is notable because our earlier tests showed only a few percentage points performance gained per OOM. o3's increase to 25% suggests both improved per-token reasoning and better scaling behavior.
10/11 I previously predicted a 25% performance by Dec 31, 2025 (my median forecast with an 80% CI of 14–60%). o3 has reached it earlier than I'd have expected on average.
11/11 Still, 25% means it’s not close to “solving” FrontierMath (e.g. >80% performance). Yet, I find o3’s performance genuinely impressive.

While it's not yet conquering the whole benchmark, I don't expect that to take more than a year or two.
12/11 We’ll certainly need even stronger benchmarks going forward. If our hardest research problems are “Tier 3,” maybe it’s time for a “Tier 4"?

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tamay Besiroglu

Tamay Besiroglu Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @tamaybes

Dec 21
I’m excited to announce the development of Tier 4, a new suite of math problems that go beyond the hardest problems in FrontierMath. o3 is remarkable, but there’s still a ways to go before any single AI system nears the collective prowess of the math community.
FrontierMath currently spans three broad tiers:
• T1 (25%) Advanced, near top-tier undergrad/IMO
• T2 (50%) Needs serious grad-level background
• T3 (25%) Research problems demanding relevant research experience
All can take hours—or days—for experts to solve.

Tier 4 aims to push the boundary even further. We want to assemble problems so challenging that solving them would demonstrate capabilities on par with an entire top mathematics department.
Read 5 tweets
May 16
A few weeks ago, we attempted to replicate the Chinchilla paper. We found that their estimated model fails to adequately fit the reconstructed data, that it implies inconsistent scaling policies, and that their confidence intervals are implausibly narrow.
The authors responded, clarifying that this was the result of their optimizer stopping early due to a bad loss scale choice. They plan to update their results and release the data. We appreciate @borgeaud_s and others' openness in addressing this issue.
This error is understandable. From experience, choosing the right optimizer and loss scale is often non-trivial, with no obvious error signs in case of poor convergence. I know at least another otherwise great paper that had a very similar issue.
Read 9 tweets
Apr 17
The Chinchilla scaling paper by Hoffmann et al. has been highly influential in the language modeling community. We tried to replicate a key part of their work and discovered discrepancies. Here's what we found. (1/9) Image
We reconstructed the data by extracting the SVG from the paper, parsing out the point locations & colors, mapping the coordinates to model size & FLOP, and mapping the colors to loss values. This let us closely approximate their original dataset from just the figure. (2/9) Image
When we fit their parametric scaling law, we get strikingly different estimates (Chi-squared p-value <1e-60!). The differences are significant for the data-scaling coefficient β and the irreducible loss E. (3/9) Image
Read 10 tweets
Mar 12
Language models have come a long way since 2012, when recurrent networks struggled to form coherent sentences. Our new paper finds that the compute needed to achieve a set performance level has been halving every 5 to 14 months on average. (1/10) Image
This rate of algorithmic progress is much faster than the two-year doubling time of Moore's Law for hardware improvements, and faster than other domains of software, like SAT-solvers, linear programs, etc. (2/10) Image
We estimate this using a dataset of over 200 language models from 2012 to 2023, evaluated on WikiText and Penn Treebank. By fitting a modified neural scaling law to this data, we estimate the rate of algorithmic efficiency improvements over time. (3/10) Image
Read 11 tweets
Dec 13, 2022
How much progress in machine learning has been due to advances in algorithms (architectures, optimisers, activation functions, etc.), and how much as been due to the scaling of compute or datasets?
@EgeErdil2 and I provide new answers: arxiv.org/abs/2212.05153
We use a dataset of over a hundred computer vision models from the last decade to investigate how better algorithms and architectures have enabled researchers to use compute and data more efficiently.
We find that every 9 months, the introduction of better algorithms contribute the equivalent of a doubling of compute budgets. This is much faster than the gains from Moore’s law! That said, there's uncertainty (our 95% CI spans 4 to 25 months).
Read 6 tweets
Jun 20, 2022
I recently organized a contest for @Metaculus on investigations into predictions of the future of AI. This resulted in two-dozen insightful analyses by forecasters into the prospects of transformatively advanced AI systems. Here are my short summaries of some that stood out:
This piece by @EgeErdil2 uses a hyperbolic growth model to argue that an economy could be transformed fairly quickly following the widespread deployment of advanced AI
metaculus.com/notebooks/1061…
He finds that a basic model implies that it'd take ~3 months to go from widespread deployment of AI to a radical transformation (with some uncertainty, but not much). At best, we may see transformative AI coming a year or two in advance.
Read 13 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(