Latest Twitter Threads by @shai_s_shwartz on Thread Reader App

Aug 14 • 9 tweets • 5 min read

Are frontier AI models really capable of “PhD-level” reasoning? To answer this question, we introduce FormulaOne, a new reasoning benchmark of expert-level Dynamic Programming problems. We have curated a benchmark consisting of three tiers, in increasing complexity, which we call ‘shallow’, ‘deeper’, ‘deepest’.

The results are remarkable:
- On the ‘shallow’ tier, top models reach performance of 50%-70%, indicating that the models are familiar with the subject matter.
- On ‘deeper’, Grok 4, Gemini-Pro, o3-Pro, Opus-4 all solve at most 1/100 problems. GPT-5 Pro is significantly better, but still solves only 4/100 problems.
- On ‘deepest’, all models collapse to 0% success rate.
🧵

2/
Why do models suffer conceptual collapse on ‘deepest’, even while achieving top-human performance in algorithmic coding competitions? The problems in the ‘deepest’ tier demand very deep reasoning, something that existing models simply can’t do.

FormulaOne may require a qualitatively different approach. We’re sharing it with the community through a Live Leaderboard and Evaluation Framework.

Nov 5, 2022 • 8 tweets • 2 min read

Classification vs. Regression is not the issue. The real question is whether you model the uncertainty. And, btw, this is not a merely academic question, it has practical implications. I'll illustrate using the "truck-and-trailer problem". 1/n

https://twitter.com/ducha_aiki/status/1587366668845588480

It is sometimes difficult to know if we see a truck and a trailer, or we see a long truck (that is, whether we see one object or two). Let's see how to model the problem of predicting the position of the rear-side of a vehicle. 2/n

Aug 15, 2022 • 8 tweets • 2 min read

Congrats to #FSD team for the great progress!
Our Supervision system shares the camera-centric approach, but differs in some key elements.
1) Crowd-based HD map vs. SD map
2) Math-based vs. simulation-based driving policy
3) Multiple redundant systems vs. a single one
0/n 1/n
Humans drive better when they are familiar with the road ahead. Furthermore, it is better to solve problems offline than to solve them online. Offline has more compute, knowledge of the future, optimal weather conditions, and the ability to validate quality.

Share this page!

Enter URL or ID to Unroll