Rohan Paul Profile picture
Jun 16, 2025 11 tweets 4 min read Read on X
This is really BAD news of LLM's coding skill. ☹️

The best Frontier LLM models achieve 0% on hard real-life Programming Contest problems, domains where expert humans still excel.

LiveCodeBench Pro, a benchmark composed of
problems from Codeforces, ICPC, and IOI (“International Olympiad in Informatics”) that are continuously updated to reduce the likelihood
of data contamination.Image
📌 The Gap Targeted

Earlier reports claimed frontier LLMs now top human grandmasters, but a cost-versus-rating plot proves otherwise.

Even the best model o4-mini-high sits near 2 100 Elo once tool calls are blocked, far from the 2 700 legend line that marks real grandmasters Image
🗂️ Building the Benchmark

A medal-winner team harvests each Codeforces, ICPC, and IOI problem as soon as a contest ends, before editorials appear, wiping out training leakage.
They store 584 tasks and tag each one as knowledge, logic, or observation heavy, producing a balanced skill matrix .Image
📊 Rating Models Fairly

Every submission is treated as a chess game against the task’s official difficulty.

A Bayesian MAP Elo fit assigns the only rating that matches human percentiles and strips out typing-speed bias Image
🎯 Where Models Shine and Fail

Figure 2 shows models sail through template zones like segment trees or dynamic programming yet plunge below 1 500 Elo on game theory, greedy tricks, and messy case work .
Zero hard-tier solves confirm the cliff. Image
🔍 Why Submissions Fail

A treemap comparison finds o3-mini commits many wrong algorithms and missed insights while humans mainly slip on implementation details.
Models also trip on samples they never run locally, something human coders catch instantly Image
🔁 More Tries, Better Outcomes

Letting o4-mini fire ten attempts lifts its rating by about 540 points and doubles medium-tier pass rate, but hard problems remain untouched at 0 % Image
🧠 Does Reasoning Help

Adding explicit chain-of-thought boosts combinatorics by up to 1 400 Elo and lifts knowledge tags, yet barely moves observation tags such as greedy or ad-hoc, hinting current reasoning traces miss the aha moment Image
💰 Terminal Power Matters

The authors estimate around 400 Elo of the published 2 700 score comes from terminal access that lets a model compile, unit-test, and brute-force patterns during inference Image
I also publish my newsletter every single day.

→ 🗞️

Includes:

- Top 1% AI Industry developments
- Influential research papers with analysis

📚 Subscribe and get a 1300+page Python book instantly. rohan-paul.com

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Rohan Paul

Rohan Paul Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rohanpaul_ai

Jan 1
🚨 BREAKING: DeepSeek dropped a core Transformer architecture improvement.

A traditional transformer is basically a long stack of blocks, and each block has a “main work path” plus a “shortcut path” called the residual connection that carries the input around the block and adds it back at the end.

Each block in this original transformer architecture does some work (self attention or a small feed forward network), then it adds the block’s input back onto the block’s output, which is why people describe it as a “main path” plus a “shortcut path.”

Hyper-Connections is a drop-in change to that shortcut path, because instead of carrying 1 stream of activations through the stack, the model carries a small bundle of parallel streams, then it learns how to mix them before a block and after a block.

Standard Transformers pass information through 1 residual stream. Hyper-Connections turn that into n parallel streams, like n lanes on a highway. Small learned matrices decide how much of each lane should mix into the others at every layer.

In a normal residual connection, each layer takes the current hidden state, runs a transformation, then adds the original back, so information can flow forward without getting stuck.

In this new Hyper-Connections, the layer does not see just 1 hidden state, it sees a small bundle of them, and before the layer it learns how to mix that bundle into the input it will process.

So in a traditional transformer block, wherever you normally do “output equals input plus block(input),” Hyper-Connections turns that into “output bundle equals a learned mix of the input bundle plus the block applied to a learned mix,” so the shortcut becomes more flexible than a plain add.

After this learned layer, the "Hyper-Connections" mechanism again learns how to mix the transformed result back into the bundle, so different lanes can carry different kinds of information, and the model can route signal through the shortcut in a more flexible way.

The catch is that if those learned mixing weights are unconstrained, stacking many blocks can make signals gradually blow up or fade out, and training becomes unstable in big models.

This paper proposes mHC, which keeps Hyper-Connections but forces every mixing step to behave like a safe averaging operation, so the shortcut stays stable while the transformer still gets the extra flexibility from multiple lanes.

---

The paper shows this stays stable at 27B scale and beats both a baseline and unconstrained Hyper-Connections on common benchmarks.

HC can hit about 3000x residual amplification, mHC keeps it around 1.6x.Image
This image compares 3 ways to build the shortcut path that carries information around a layer in a transformer.

The left panel is the normal residual connection, where the model adds the layer output back to the original input so training stays steady as depth grows.

The middle panel is Hyper-Connections, where the model keeps several parallel shortcut streams and learns how to mix them before the layer, around the layer, and after the layer, which can help quality but can also make the shortcut accidentally amplify or shrink signals when many layers stack.

The right panel is mHC, which keeps the same Hyper-Connections idea but forces those mixing steps to stay in a constrained safe shape every time, so the shortcut behaves like a controlled blend and stays stable at large scale.Image
What “hyper-connection” means here.

You widen the residual from size C to n×C, treat it as n streams, and learn 3 tiny mixing pieces per layer.

One mixes the residual streams with each other, this is the crucial one. One gathers from the streams into the layer. One writes results back to the streams.

The paper’s contribution is to keep the first one in the safe “doubly stochastic” set, so it mixes without amplifying.Image
Read 10 tweets
Dec 25, 2025
A MASSIVE 303 page study from the very best Chinese Labs.

The paper explains how code focused language models are built, trained, and turned into software agents that help run parts of development.

These models read natural language instructions, like a bug report or feature request, and try to output working code that matches the intent.

The authors first walk through the training pipeline, from collecting and cleaning large code datasets to pretraining, meaning letting the model absorb coding patterns at scale.

They then describe supervised fine tuning and reinforcement learning, which are extra training stages that reward the model for following instructions, passing tests, and avoiding obvious mistakes.

On top of these models, the paper surveys software engineering agents, which wrap a model in a loop that reads issues, plans steps, edits files, runs tests, and retries when things fail.

Across the survey, they point out gaps like handling huge repositories, keeping generated code secure, and evaluating agents reliably, and they share practical tricks that current teams can reuse.Image
Overview of the evolution of code large language models (Code-LLMs) and related ecosystems from 2021 to 2025. Image
Evolution of programming development and research landscapes in AI-powered code generation. Image
Read 17 tweets
Dec 3, 2025
A MASSIVE 303 page study from the very best Chinese Labs.

The paper explains how code focused language models are built, trained, and turned into software agents that help run parts of development.

These models read natural language instructions, like a bug report or feature request, and try to output working code that matches the intent.

The authors first walk through the training pipeline, from collecting and cleaning large code datasets to pretraining, meaning letting the model absorb coding patterns at scale.

They then describe supervised fine tuning and reinforcement learning, which are extra training stages that reward the model for following instructions, passing tests, and avoiding obvious mistakes.

On top of these models, the paper surveys software engineering agents, which wrap a model in a loop that reads issues, plans steps, edits files, runs tests, and retries when things fail.

Across the survey, they point out gaps like handling huge repositories, keeping generated code secure, and evaluating agents reliably, and they share practical tricks that current teams can reuse.Image
Overview of the evolution of code large language models (Code-LLMs) and related ecosystems from 2021 to 2025. Image
Evolution of programming development and research landscapes in AI-powered code generation. Image
Read 17 tweets
Nov 29, 2025
McKinsey published new report.

Agents, robots, and us: Skill partnerships in the age of AI

- Today’s technologies could theoretically automate more than half of current US work hours. This reflects how profoundly work may change

- By 2030, about $2.9 trillion of economic value could be unlocked in the United States

- Demand for AI fluency—the ability to use and manage AI tools—has grown 7X in two years, faster than for any other skill in US job postings. The surge is visible across industries and likely marks the beginning of much bigger changes ahead.Image
Image
Two-thirds of US work hours require only nonphysical capabilities. Image
Read 8 tweets
Nov 17, 2025
"The Impact of Artificial Intelligence on Human Thought"

A big 132 page report.

AI is shifting real thinking work onto external systems, which boosts convenience but can weaken the effort that builds understanding and judgment,

A pattern the paper frames through cognitive offloading and cognitive load theory, and then tracks into social effects like standardized language and biased information flows, and manipulation tactics that target human psychology.

It says use AI to cut noise and routine steps, keep humans doing the heavy mental lifting, and add controls because personalization, deepfakes, and opaque models can steer choices at scale.

🧵 Read on 👇Image
🧵2/n. ⚙️ The Core Concepts

Cognitive load theory says working memory is limited, so AI helps when it reduces extraneous load and hurts when it replaces the germane load needed to build skill.

In plain terms, let tools clean up the interface and fetch data, but keep people doing the analysis, explanation, and sense‑making.Image
🧵3/n. 🧰 Offloading and memory

Handing memory, calculation, or choosing to an external aid frees attention now, yet steady offloading can dull recall and critical habits later.

The paper casts web search, note apps, and assistants as a human‑machine transactive memory system, useful when sources are reliable, risky when they are biased or wrong.

That is why trust and verification routines matter as much as speed.Image
Read 9 tweets
Nov 2, 2025
16 charts that explain the AI boom

Quite insightful blog by Kai Williams. 👏

1. The largest technology companies are investing heavily in AI Image
2. AI spending is significant in historical terms Image
3. Companies are importing a lot of AI chips Image
Read 17 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(