Thread by @rohanpaul_ai on Thread Reader App

This is really BAD news of LLM's coding skill. ☹️

The best Frontier LLM models achieve 0% on hard real-life Programming Contest problems, domains where expert humans still excel.

LiveCodeBench Pro, a benchmark composed of
problems from Codeforces, ICPC, and IOI (“International Olympiad in Informatics”) that are continuously updated to reduce the likelihood
of data contamination.

📌 The Gap Targeted

Earlier reports claimed frontier LLMs now top human grandmasters, but a cost-versus-rating plot proves otherwise.

Even the best model o4-mini-high sits near 2 100 Elo once tool calls are blocked, far from the 2 700 legend line that marks real grandmasters

🗂️ Building the Benchmark

A medal-winner team harvests each Codeforces, ICPC, and IOI problem as soon as a contest ends, before editorials appear, wiping out training leakage.
They store 584 tasks and tag each one as knowledge, logic, or observation heavy, producing a balanced skill matrix .

📊 Rating Models Fairly

Every submission is treated as a chess game against the task’s official difficulty.

A Bayesian MAP Elo fit assigns the only rating that matches human percentiles and strips out typing-speed bias

🎯 Where Models Shine and Fail

Figure 2 shows models sail through template zones like segment trees or dynamic programming yet plunge below 1 500 Elo on game theory, greedy tricks, and messy case work .
Zero hard-tier solves confirm the cliff.

🔍 Why Submissions Fail

A treemap comparison finds o3-mini commits many wrong algorithms and missed insights while humans mainly slip on implementation details.
Models also trip on samples they never run locally, something human coders catch instantly

🔁 More Tries, Better Outcomes

Letting o4-mini fire ten attempts lifts its rating by about 540 points and doubles medium-tier pass rate, but hard problems remain untouched at 0 %

🧠 Does Reasoning Help

Adding explicit chain-of-thought boosts combinatorics by up to 1 400 Elo and lifts knowledge tags, yet barely moves observation tags such as greedy or ad-hoc, hinting current reasoning traces miss the aha moment

💰 Terminal Power Matters

The authors estimate around 400 Elo of the published 2 700 score comes from terminal access that lets a model compile, unit-test, and brute-force patterns during inference

arxiv.org/pdf/2506.11928

I also publish my newsletter every single day.

→ 🗞️

Includes:

- Top 1% AI Industry developments
- Influential research papers with analysis

📚 Subscribe and get a 1300+page Python book instantly. rohan-paul.com

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll