Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Daniel Litt

@littmath

Mar 8 • 23 tweets • 5 min read • Read on X

In this thread I want to share some thoughts about the FrontierMath benchmark, on which, according to OpenAI, some frontier models are scoring ~20%. This is benchmark consisting of difficult math problems with numerical answers. What does it measure, and what doesn't it measure?

I'll try to organize my thoughts around this problem. To be clear I don't intend any of this as criticism of the benchmark or of Epoch AI, which I think is doing fantastic work. Please understand anything you'd read as criticism as aimed at the problem's author, namely me.

(FWIW the problem is not exactly as I wrote it--I asked for the precise value of the limit L, not its first 3 digits. The answer is the sum of the reciprocals of the first 3264 positive integers. That said I think the edit made by Epoch did not alter the question substantially.)

When I wrote this problem I was convinced it was far out of reach of existing models. It's not clear to me whether it was solved during the internal OpenAI evaluations, but o3-mini-high solves it fairly regularly. When o3-mini-high came out and I discovered this I was shocked.

Here is an example from today of o3-mini-high getting the right answer:

FWIW it took me 3 tries and some prompt manipulation to get this to work, but I find it exceedingly impressive.chatgpt.com/share/67cc9a1c…

The argument is more or less what I outline for the official solution, which you can see here:

It has 4 steps, which I outline below.epoch.ai/frontiermath/b…

(1) Over an algebraically closed field, the number of conics tangent to five general conics is 3264. This was discovered by a number of people in the 1850s and 1860s, correcting an earlier incorrect claim by Steiner.

(2) The "Galois group" of the relevant enumerative problem is S_3264, the full symmetric group on 3264 letters. This was proven by Harris in 1979.

(3) The number of components we are interested in is the number of cycles of Frobenius acting on this 3264-element set. As p gets large Frobenius is equidistributed in the symmetric group, by Chebotarev density.

(4) The number of cycles of a random permutation in S_n is the n-th harmonic number. This is a classical combinatorics fact.

Putting these facts together, we win.

This is more or less what o3-mini-high seems to be doing when it successfully answers the question. To be clear, this is extremely impressive in my view.

What makes this problem hard? I think this is where I made two mistakes when I was writing it. The first thing that makes the problem hard is that it requires a lot of background--you have to know the facts 1-4 above. Most mathematicians don't know these facts.

Indeed, even understanding some of the statements--like the relevant form of the Chebotarev density theorem, what the Galois group of an enumerative problem are, etc. require a fair amount of background.

The second thing is that proving these statements is hard. And I hadn't internalized that to answer the question YOU DON'T NEED TO PROVE THESE STATEMENTS.

Knowing obscure facts is hard for a person but much easier for an LLM. And if you ask the LLM to prove any of 1-4 above, it will typically whiff.

So what is the benchmark measuring? I think it's something like the following: (a) how much known (if possibly obscure) mathematical knowledge does the LLM have, and (b) can it match a problem to the knowledge it has memorized.

Thinking about this over the past couple of months, I've come to the conclusion that a fair amount of math research actually does have this flavor--one looks up or recalls some known facts and puts them together. This is the 90% of math research that is "routine."

What this suggests to me is that these reasoning models are not too far from being very useful aids in this part of doing math. I expect them to be regularly useful at this kind of thing by the end of the year.

What about the non-routine part of math research--coming up with genuinely new ideas or techniques, understanding previously poorly-understood structures, etc.? First, I think it's worth saying that this is (i) the important part of research, and (ii) it happens pretty rarely.

I'm not necessarily skeptical that an AI tool can do this. If this incident has revealed anything to me it is that I don't necessarily fully understand what skills one needs to do non-routine mathematics work.

My sense is that the skills used to do this are quite different from what the AI is using to solve FrontierMath problems--that is, I don't think the primary skill being used is recall. It is, I think, something more like philosophy, and it's less clear how to train for it.

I've been thinking hard about how one would make a benchmark that is more about what I think of as the fundamental skills for math research. Please let me know if you have any thoughts on this.

https://x.com/littmath/status/1898550947955048743?s=46&t=41cpGRZavPniB0OfiIMsDA

Addendum on a related problem:

https://x.com/littmath/status/1898550947955048743?s=46&t=41cpGRZavPniB0OfiIMsDA

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @littmath

Daniel Litt

@littmath

Jul 21

One piece of info that seems important to me in terms of forecasting usefulness of new AI models for mathematics: did the gold-medal-winning models, which did not solve IMO problem 6, submit incorrect answers for it? 🧵

https://x.com/littmath/status/1874559283859501567

Why this is important: it is very hard/time-intensive to check correctness of English-language mathematical proofs. (See below for a thread on this topic.)

https://x.com/littmath/status/1874559283859501567

A tool that can solve hard math problems but also produces incorrect answers to problems it can't solve could end up being a net productivity sink--or even worse, if it fools the user, lead to incorrect claims.

Read 8 tweets

Daniel Litt

@littmath

Jul 20

An AI tool that gets gold on the IMO is obviously immensely impressive. Does it mean math is “solved”? Is an AI-generated proof of the Riemann hypothesis clearly on the horizon? Obviously not.

Worth keeping timescales in mind here: IMO competitors spend an average of 1.5 hrs on each problem. High-quality math research, by contrast, takes month or years.

What are the obstructions to AI performing high-quality autonomous math research? I don’t claim to know for sure, but I think they include many of the same obstructions that prevent it from doing many jobs:

Read 8 tweets

Daniel Litt

@littmath

Apr 17

In this thread I'll record some brief impressions from trying to use o3/o4-mini (the new OpenAI models) for mathematical tasks.

Before I start let me say a bit about how I think about this. I try to think about how useful a product is for math research in terms of its marginal value over Googling (MVoG). This is sort of a complicated metric.

A lot of undergraduate-level mathematical statements are Google-able; a model's ability to do homework problems well doesn't dramatically increase its usefulness for research. And results/proofs I find by googling are typically fairly reliable.

Read 33 tweets

Daniel Litt

@littmath

Feb 18

Had a sort of a funny experience with OpenAI’s Deep Research tool, which I wanted to share since I think it reveals some of the tool’s strengths and weaknesses.

.@srivatsamath recently suggested to me that (as a result of the Internet democratizing access to advanced math), there’s been an increase in important math research done by young people. I was curious if this is true.

This seemed to me to be a good use case for Deep Research. As a (admittedly poor) proxy for the question, I asked it about the age of authors publishing in the Annals of Mathematics, arguably one of the top math journals, from 1950-2025.

Read 13 tweets

Daniel Litt

@littmath

Feb 4

Some very brief first impressions from my attempts to use OpenAI's new Deep Research project to do mathematics. I'm very grateful to the person at OpenAI who gave me access.

Some short caveats: I'm just trying to evaluate the product as it currently stands, not the (obviously very rapid) pace of progress. This thread is not an attempt to forecast anything. And of course it's possible I am not using it in an ideal way.

https://x.com/littmath/status/1885566673844842739

Generally speaking I am bullish about using LLMs for mathematics--see here for an overview of my attempt to use o3-mini-high to get some value for mathematics research.

https://x.com/littmath/status/1885566673844842739

Read 15 tweets

Daniel Litt

@littmath

Feb 1

Some brief impressions from playing a bit with o3-mini-high (the new reasoning model released by OpenAI today) for mathematical uses.

First of all, it’s clearly a significant improvement over o1. It immediately solved (non-rigorously) some arithmetic geometry problems with numerical answers that I posed to it, which no other models have been able to solve. I consider these problems pretty tricky.

Next I asked it to factor the polynomial x^5-x-1 over the complex numbers. It gave me an incorrect factorization, and in fact claimed to be able to write down the roots of this polynomial in radicals.

Read 16 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Daniel Litt

Try unrolling a thread yourself!

More from @littmath

Daniel Litt

Daniel Litt

Daniel Litt

Daniel Litt

Daniel Litt

Daniel Litt

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!