Daniel Litt Profile picture
Mar 8 23 tweets 5 min read Read on X
In this thread I want to share some thoughts about the FrontierMath benchmark, on which, according to OpenAI, some frontier models are scoring ~20%. This is benchmark consisting of difficult math problems with numerical answers. What does it measure, and what doesn't it measure? Image
I'll try to organize my thoughts around this problem. To be clear I don't intend any of this as criticism of the benchmark or of Epoch AI, which I think is doing fantastic work. Please understand anything you'd read as criticism as aimed at the problem's author, namely me. Image
(FWIW the problem is not exactly as I wrote it--I asked for the precise value of the limit L, not its first 3 digits. The answer is the sum of the reciprocals of the first 3264 positive integers. That said I think the edit made by Epoch did not alter the question substantially.)
When I wrote this problem I was convinced it was far out of reach of existing models. It's not clear to me whether it was solved during the internal OpenAI evaluations, but o3-mini-high solves it fairly regularly. When o3-mini-high came out and I discovered this I was shocked.
Here is an example from today of o3-mini-high getting the right answer:

FWIW it took me 3 tries and some prompt manipulation to get this to work, but I find it exceedingly impressive.chatgpt.com/share/67cc9a1c…
The argument is more or less what I outline for the official solution, which you can see here:

It has 4 steps, which I outline below.epoch.ai/frontiermath/b…
(1) Over an algebraically closed field, the number of conics tangent to five general conics is 3264. This was discovered by a number of people in the 1850s and 1860s, correcting an earlier incorrect claim by Steiner.
(2) The "Galois group" of the relevant enumerative problem is S_3264, the full symmetric group on 3264 letters. This was proven by Harris in 1979.
(3) The number of components we are interested in is the number of cycles of Frobenius acting on this 3264-element set. As p gets large Frobenius is equidistributed in the symmetric group, by Chebotarev density.
(4) The number of cycles of a random permutation in S_n is the n-th harmonic number. This is a classical combinatorics fact.

Putting these facts together, we win.
This is more or less what o3-mini-high seems to be doing when it successfully answers the question. To be clear, this is extremely impressive in my view.
What makes this problem hard? I think this is where I made two mistakes when I was writing it. The first thing that makes the problem hard is that it requires a lot of background--you have to know the facts 1-4 above. Most mathematicians don't know these facts.
Indeed, even understanding some of the statements--like the relevant form of the Chebotarev density theorem, what the Galois group of an enumerative problem are, etc. require a fair amount of background.
The second thing is that proving these statements is hard. And I hadn't internalized that to answer the question YOU DON'T NEED TO PROVE THESE STATEMENTS.
Knowing obscure facts is hard for a person but much easier for an LLM. And if you ask the LLM to prove any of 1-4 above, it will typically whiff.
So what is the benchmark measuring? I think it's something like the following: (a) how much known (if possibly obscure) mathematical knowledge does the LLM have, and (b) can it match a problem to the knowledge it has memorized.
Thinking about this over the past couple of months, I've come to the conclusion that a fair amount of math research actually does have this flavor--one looks up or recalls some known facts and puts them together. This is the 90% of math research that is "routine."
What this suggests to me is that these reasoning models are not too far from being very useful aids in this part of doing math. I expect them to be regularly useful at this kind of thing by the end of the year.
What about the non-routine part of math research--coming up with genuinely new ideas or techniques, understanding previously poorly-understood structures, etc.? First, I think it's worth saying that this is (i) the important part of research, and (ii) it happens pretty rarely.
I'm not necessarily skeptical that an AI tool can do this. If this incident has revealed anything to me it is that I don't necessarily fully understand what skills one needs to do non-routine mathematics work.
My sense is that the skills used to do this are quite different from what the AI is using to solve FrontierMath problems--that is, I don't think the primary skill being used is recall. It is, I think, something more like philosophy, and it's less clear how to train for it.
I've been thinking hard about how one would make a benchmark that is more about what I think of as the fundamental skills for math research. Please let me know if you have any thoughts on this.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Daniel Litt

Daniel Litt Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @littmath

Feb 18
Had a sort of a funny experience with OpenAI’s Deep Research tool, which I wanted to share since I think it reveals some of the tool’s strengths and weaknesses.
.@srivatsamath recently suggested to me that (as a result of the Internet democratizing access to advanced math), there’s been an increase in important math research done by young people. I was curious if this is true.
This seemed to me to be a good use case for Deep Research. As a (admittedly poor) proxy for the question, I asked it about the age of authors publishing in the Annals of Mathematics, arguably one of the top math journals, from 1950-2025.
Read 13 tweets
Feb 4
Some very brief first impressions from my attempts to use OpenAI's new Deep Research project to do mathematics. I'm very grateful to the person at OpenAI who gave me access.
Some short caveats: I'm just trying to evaluate the product as it currently stands, not the (obviously very rapid) pace of progress. This thread is not an attempt to forecast anything. And of course it's possible I am not using it in an ideal way.
Generally speaking I am bullish about using LLMs for mathematics--see here for an overview of my attempt to use o3-mini-high to get some value for mathematics research.
Read 15 tweets
Feb 1
Some brief impressions from playing a bit with o3-mini-high (the new reasoning model released by OpenAI today) for mathematical uses.
First of all, it’s clearly a significant improvement over o1. It immediately solved (non-rigorously) some arithmetic geometry problems with numerical answers that I posed to it, which no other models have been able to solve. I consider these problems pretty tricky.
Next I asked it to factor the polynomial x^5-x-1 over the complex numbers. It gave me an incorrect factorization, and in fact claimed to be able to write down the roots of this polynomial in radicals.
Read 16 tweets
Jan 24
I want to explain in down-to-earth terms what this paper is about, since it ultimately boils down to what I think are some really concrete and fundamental questions. 1/n Image
If you ask a middle-schooler, they might tell you that mathematics is about "solving equations." What does this mean in practice for modern mathematicians, though? Often one abstractly proves a solution with some nice property exists, studies the set of solutions, etc. 2/n
Our hypothetical middle-schooler might find this unsatisfying. "You should write down an equation *explicitly*!" they might reasonably complain. And understanding when this is possible has indeed been a fundamental mathematical question for some time. 3/n
Read 23 tweets
Jan 1
I've recently been talking a bit about how difficult it is to carefully check even well-written mathematics. I want to try to explain something about this by telling the story of some errors in the literature that (in part) led to the two papers below. 1/n Image
Image
These papers began with an attempt to prove an open conjecture in surface topology—the Putman-Wieland conjecture. At some point in mid-2021 Aaron and I were pretty convinced we had a proof, and began writing up the details. 2/n
After writing 20 or so pages, I started to get worried. We were in fact proving something stronger than the Putman-Wieland conjecture—something to which there was a counterexample in the literature. There had to be an error somewhere. 3/n
Read 16 tweets
Dec 22, 2024
A couple more brief thoughts on o3’s (incredible) performance on FrontierMath. Image
First of all, everything I’m saying is based on the 5 publicly shared problems. I don’t know what the rest of the benchmark looks like, but based on these it’s clearly hard.
In what way is it hard? Epoch AI says that the easiest problems are at the level of IMO/Putnam problems. FWIW I think this is a reasonable but imperfect way to communicate their difficulty.
Read 16 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(