One piece of info that seems important to me in terms of forecasting usefulness of new AI models for mathematics: did the gold-medal-winning models, which did not solve IMO problem 6, submit incorrect answers for it? 🧵
Why this is important: it is very hard/time-intensive to check correctness of English-language mathematical proofs. (See below for a thread on this topic.)
A tool that can solve hard math problems but also produces incorrect answers to problems it can't solve could end up being a net productivity sink--or even worse, if it fools the user, lead to incorrect claims.
One worry I have is that AI tools will develop their ability to produce hard-to-check, convincing mathematical prose more rapidly than their ability to produce correct answers. If so I suspect we will soon see a deluge of persuasive-looking, incorrect results.
I've recently noticed an uptick in crank papers in my area submitted to arXiv -- 7 out of 9 papers with "Hodge conjecture" in the title or abstract since June 15 have been nonsense. I think almost all of these were written with LLM aid.
Currently it's quite easy for me to check these papers are nonsense (a matter of a few minutes) but as capabilities improve, this will become more difficult. It's quite possible we will soon no longer be able to trust a randomly-chosen math paper on arXiv.
Of course (auto)formalization will help to address this--but in my area any sort of formalization is, I think, pretty far away.
If these models did not submit a wrong answer to P6, I think that's a pretty bullish sign for the near-term future of AI-for-math. If they did... /fin
• • •
Missing some Tweet in this thread? You can try to
force a refresh
An AI tool that gets gold on the IMO is obviously immensely impressive. Does it mean math is “solved”? Is an AI-generated proof of the Riemann hypothesis clearly on the horizon? Obviously not.
Worth keeping timescales in mind here: IMO competitors spend an average of 1.5 hrs on each problem. High-quality math research, by contrast, takes month or years.
What are the obstructions to AI performing high-quality autonomous math research? I don’t claim to know for sure, but I think they include many of the same obstructions that prevent it from doing many jobs:
In this thread I'll record some brief impressions from trying to use o3/o4-mini (the new OpenAI models) for mathematical tasks.
Before I start let me say a bit about how I think about this. I try to think about how useful a product is for math research in terms of its marginal value over Googling (MVoG). This is sort of a complicated metric.
A lot of undergraduate-level mathematical statements are Google-able; a model's ability to do homework problems well doesn't dramatically increase its usefulness for research. And results/proofs I find by googling are typically fairly reliable.
In this thread I want to share some thoughts about the FrontierMath benchmark, on which, according to OpenAI, some frontier models are scoring ~20%. This is benchmark consisting of difficult math problems with numerical answers. What does it measure, and what doesn't it measure?
I'll try to organize my thoughts around this problem. To be clear I don't intend any of this as criticism of the benchmark or of Epoch AI, which I think is doing fantastic work. Please understand anything you'd read as criticism as aimed at the problem's author, namely me.
(FWIW the problem is not exactly as I wrote it--I asked for the precise value of the limit L, not its first 3 digits. The answer is the sum of the reciprocals of the first 3264 positive integers. That said I think the edit made by Epoch did not alter the question substantially.)
Had a sort of a funny experience with OpenAI’s Deep Research tool, which I wanted to share since I think it reveals some of the tool’s strengths and weaknesses.
.@srivatsamath recently suggested to me that (as a result of the Internet democratizing access to advanced math), there’s been an increase in important math research done by young people. I was curious if this is true.
This seemed to me to be a good use case for Deep Research. As a (admittedly poor) proxy for the question, I asked it about the age of authors publishing in the Annals of Mathematics, arguably one of the top math journals, from 1950-2025.
Some very brief first impressions from my attempts to use OpenAI's new Deep Research project to do mathematics. I'm very grateful to the person at OpenAI who gave me access.
Some short caveats: I'm just trying to evaluate the product as it currently stands, not the (obviously very rapid) pace of progress. This thread is not an attempt to forecast anything. And of course it's possible I am not using it in an ideal way.
Generally speaking I am bullish about using LLMs for mathematics--see here for an overview of my attempt to use o3-mini-high to get some value for mathematics research.
Some brief impressions from playing a bit with o3-mini-high (the new reasoning model released by OpenAI today) for mathematical uses.
First of all, it’s clearly a significant improvement over o1. It immediately solved (non-rigorously) some arithmetic geometry problems with numerical answers that I posed to it, which no other models have been able to solve. I consider these problems pretty tricky.
Next I asked it to factor the polynomial x^5-x-1 over the complex numbers. It gave me an incorrect factorization, and in fact claimed to be able to write down the roots of this polynomial in radicals.