Claim: gpt-5-pro can prove new interesting mathematics.
Proof: I took a convex optimization paper with a clean open problem in it and asked gpt-5-pro to work on it. It proved a better bound than what is in the paper, and I checked the proof it's correct.
Details below.
The paper in question is this one which studies the following very natural question: in smooth convex optimization, under what conditions on the stepsize eta in gradient descent will the curve traced by the function value of the iterates be convex?arxiv.org/pdf/2503.10138…
In the v1 of the paper they prove that if eta is smaller than 1/L (L is the smoothness) then one gets this property, and if eta is larger than 1.75/L then they construct a counterexample. So the open problem was: what happens in the range [1/L, 1.75/L].
As you can see in the top post, gpt-5-pro was able to improve the bound from this paper and showed that in fact eta can be taken to be as large as 1.5/L, so not quite fully closing the gap but making good progress. Def. a novel contribution that'd be worthy of a nice arxiv note.
Now the only reason why I won't post this as an arxiv note, is that the humans actually beat gpt-5 to the punch :-). Namely the arxiv paper has a v2 with an additional author and they closed the gap completely, showing that 1.75/L is the tight bound.arxiv.org/pdf/2503.10138…
By the way this is the proof it came up with:
And yeah the fact that it proves 1.5/L and not the 1.75/L also shows it didn't just search for the v2. Also the above proof is very different from the v2 proof, it's more of an evolution of the v1 proof.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
It's getting harder and harder to get signal from benchmark numbers. Rather than averages, I except in the (near) future we will also care about "argmax": what's the BEST output a model can deliver? After all, we don't need to solve PvsNP 10 out of 10 times, once is enough 😅. So with that in mind let me tell you a bit more about THE MOST IMPRESSIVE LLM OUTPUT I have ever seen.
Many of you are probably familiar with the preferential attachment model (also known as Barabasi-Albert), which is a growing random graph process, where each new node attach to an existing ones with probability proportional to their degrees. This is very similar to how a network like X grows (popular accounts attract more and more followers).
(beautiful gif by Igor Kortchemski)
In 2012 Ramon and Comendant introduced in their COLT open problem a variant of the preferential attachment process, where each node is born with an "attractiveness" parameter, either 1 or W>1, and now the connection to a previous node is proportional to the total attractiveness of its neighbors! I.e., if you have a lot of attractive friends, you will attract more connections. I guess it simulates how an AI research lab grows or something 🤣. The very simple question they asked is: can one estimate W from observing the graph? [They asked something more quantitative than this actually, but that's the essence of their question.]
o3-mini is a remarkable model. Somehow it has *grokked arxiv* in a way that no other model on the planet has, turning it into a valuable research partner!
Below is a deceitfully simple question that confuses *all* other models but where o3-mini gives an extremely useful answer!
Indeed it hits all the right things: the connection with self-contracted curves, the dimension-dependent bound, and even cites a relevant paper!
This is not cherry picked at all and literally the first query I made. Here is the second query in a completely different topic:
Interestingly though the reference it gives “Bubeck and Ganguly” is not quite the correct one, but it is very closely related! In general I have found that references are "fuzzily correct", giving some mixed up version of authors/journals/titles, but surprisingly still useful!
The @TEDTalks by @YejinChoinka is both insightful & beautifully delivered! Totally agree with her that GPT-4 is simultaneously brilliant and incredibly stupid. Yejin gives 3 examples of common sense failing that are worth examining a bit more closely. 1/5
First example is shocking, GPT-4 does not seem to understand that one can dry clothes in parallel?!? However, w. a more open-ended formulation ("what would happen if") one gets a great answer! This example shows that GPT-4 remains quite brittle with short questions like this. 2/5
Second example is particularly interesting to me because it doesn't match at all my experience with GPT-4. What's going on? Turns out there is a simple explanation: this is pure pattern matching in action! GPT-4 memorized from online sources like this one: mathsisfun.com/puzzles/measur…
Posts can be found here iclr-blog-track.github.io/blog/. They fulfill original promise: they add/replicate experiments, they illuminate prior work w. a different theoretical framework, they add new insights,.. We strongly believe there is room for such a track in all ML conferences. 2/24
Our first post is an invited post by @karpathy on @ylecun's revolutionary 1989 paper on convolutional neural networks. This post illustrates perfectly what a blogpost track can do for a conference, which includes highlighting influential work! 3/24 iclr-blog-track.github.io/2022/03/26/lec…
New video! Probably best described as "a motivational speech to study deep learning mathematically" :-).
The ever so slightly more formal title is "Mathematical theory of deep learning: Can we do it? Should we do it?"
1/3
Context for this talk was an NSF Town Hall with goal to discuss successes of deep learning especially, in light of more traditional fields. Other talks by @tomgoldsteincs@joanbruna@ukmlv, Yuejie Chi, Guy Bresler, Rina Foygel Barber at this link: players.brightcove.net/679256133001/N…
2/3
But honestly I think my talk might as well have been just the title slide followed by the wise words in this video:
We may have found a solid hypothesis to explain why extreme overparametrization is so helpful in #DeepLearning, especially if one is concerned about adversarial robustness. arxiv.org/abs/2105.12806 1/7
With my student extraordinaire Mark Sellke @geoishard, we prove a vast generalization of our conjectured law of robustness from last summer, that there is an inherent tradeoff between # neurons and smoothness of the network (see *pre-solution* video). 2/7
If you squint hard enough (eg, like a physicist) our new universal law of robustness even makes concrete predictions for real data. For ex. we predict that on ImageNet you need at least 100 billion parameters (i.e., GPT-3-like scale) to possibly attain good robust guarantees. 3/7