Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Daniel Litt

@littmath

Jul 21 • 8 tweets • 2 min read • Read on X

One piece of info that seems important to me in terms of forecasting usefulness of new AI models for mathematics: did the gold-medal-winning models, which did not solve IMO problem 6, submit incorrect answers for it? 🧵

https://x.com/littmath/status/1874559283859501567

Why this is important: it is very hard/time-intensive to check correctness of English-language mathematical proofs. (See below for a thread on this topic.)

https://x.com/littmath/status/1874559283859501567

A tool that can solve hard math problems but also produces incorrect answers to problems it can't solve could end up being a net productivity sink--or even worse, if it fools the user, lead to incorrect claims.

One worry I have is that AI tools will develop their ability to produce hard-to-check, convincing mathematical prose more rapidly than their ability to produce correct answers. If so I suspect we will soon see a deluge of persuasive-looking, incorrect results.

I've recently noticed an uptick in crank papers in my area submitted to arXiv -- 7 out of 9 papers with "Hodge conjecture" in the title or abstract since June 15 have been nonsense. I think almost all of these were written with LLM aid.

Currently it's quite easy for me to check these papers are nonsense (a matter of a few minutes) but as capabilities improve, this will become more difficult. It's quite possible we will soon no longer be able to trust a randomly-chosen math paper on arXiv.

Of course (auto)formalization will help to address this--but in my area any sort of formalization is, I think, pretty far away.

If these models did not submit a wrong answer to P6, I think that's a pretty bullish sign for the near-term future of AI-for-math. If they did... /fin

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @littmath

Daniel Litt

@littmath

Jul 20

An AI tool that gets gold on the IMO is obviously immensely impressive. Does it mean math is “solved”? Is an AI-generated proof of the Riemann hypothesis clearly on the horizon? Obviously not.

Worth keeping timescales in mind here: IMO competitors spend an average of 1.5 hrs on each problem. High-quality math research, by contrast, takes month or years.

What are the obstructions to AI performing high-quality autonomous math research? I don’t claim to know for sure, but I think they include many of the same obstructions that prevent it from doing many jobs:

Read 8 tweets

Daniel Litt

@littmath

Apr 17

In this thread I'll record some brief impressions from trying to use o3/o4-mini (the new OpenAI models) for mathematical tasks.

Before I start let me say a bit about how I think about this. I try to think about how useful a product is for math research in terms of its marginal value over Googling (MVoG). This is sort of a complicated metric.

A lot of undergraduate-level mathematical statements are Google-able; a model's ability to do homework problems well doesn't dramatically increase its usefulness for research. And results/proofs I find by googling are typically fairly reliable.

Read 33 tweets

Daniel Litt

@littmath

Mar 8

In this thread I want to share some thoughts about the FrontierMath benchmark, on which, according to OpenAI, some frontier models are scoring ~20%. This is benchmark consisting of difficult math problems with numerical answers. What does it measure, and what doesn't it measure?

I'll try to organize my thoughts around this problem. To be clear I don't intend any of this as criticism of the benchmark or of Epoch AI, which I think is doing fantastic work. Please understand anything you'd read as criticism as aimed at the problem's author, namely me.

(FWIW the problem is not exactly as I wrote it--I asked for the precise value of the limit L, not its first 3 digits. The answer is the sum of the reciprocals of the first 3264 positive integers. That said I think the edit made by Epoch did not alter the question substantially.)

Read 23 tweets

Daniel Litt

@littmath

Feb 18

Had a sort of a funny experience with OpenAI’s Deep Research tool, which I wanted to share since I think it reveals some of the tool’s strengths and weaknesses.

.@srivatsamath recently suggested to me that (as a result of the Internet democratizing access to advanced math), there’s been an increase in important math research done by young people. I was curious if this is true.

This seemed to me to be a good use case for Deep Research. As a (admittedly poor) proxy for the question, I asked it about the age of authors publishing in the Annals of Mathematics, arguably one of the top math journals, from 1950-2025.

Read 13 tweets

Daniel Litt

@littmath

Feb 4

Some very brief first impressions from my attempts to use OpenAI's new Deep Research project to do mathematics. I'm very grateful to the person at OpenAI who gave me access.

Some short caveats: I'm just trying to evaluate the product as it currently stands, not the (obviously very rapid) pace of progress. This thread is not an attempt to forecast anything. And of course it's possible I am not using it in an ideal way.

https://x.com/littmath/status/1885566673844842739

Generally speaking I am bullish about using LLMs for mathematics--see here for an overview of my attempt to use o3-mini-high to get some value for mathematics research.

https://x.com/littmath/status/1885566673844842739

Read 15 tweets

Daniel Litt

@littmath

Feb 1

Some brief impressions from playing a bit with o3-mini-high (the new reasoning model released by OpenAI today) for mathematical uses.

First of all, it’s clearly a significant improvement over o1. It immediately solved (non-rigorously) some arithmetic geometry problems with numerical answers that I posed to it, which no other models have been able to solve. I consider these problems pretty tricky.

Next I asked it to factor the polynomial x^5-x-1 over the complex numbers. It gave me an incorrect factorization, and in fact claimed to be able to write down the roots of this polynomial in radicals.

Read 16 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Daniel Litt

Try unrolling a thread yourself!

More from @littmath

Daniel Litt

Daniel Litt

Daniel Litt

Daniel Litt

Daniel Litt

Daniel Litt

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!