The best Frontier LLM models achieve 0% on hard real-life Programming Contest problems, domains where expert humans still excel.
LiveCodeBench Pro, a benchmark composed of
problems from Codeforces, ICPC, and IOI (“International Olympiad in Informatics”) that are continuously updated to reduce the likelihood
of data contamination.
📌 The Gap Targeted
Earlier reports claimed frontier LLMs now top human grandmasters, but a cost-versus-rating plot proves otherwise.
Even the best model o4-mini-high sits near 2 100 Elo once tool calls are blocked, far from the 2 700 legend line that marks real grandmasters
🗂️ Building the Benchmark
A medal-winner team harvests each Codeforces, ICPC, and IOI problem as soon as a contest ends, before editorials appear, wiping out training leakage.
They store 584 tasks and tag each one as knowledge, logic, or observation heavy, producing a balanced skill matrix .
📊 Rating Models Fairly
Every submission is treated as a chess game against the task’s official difficulty.
A Bayesian MAP Elo fit assigns the only rating that matches human percentiles and strips out typing-speed bias
🎯 Where Models Shine and Fail
Figure 2 shows models sail through template zones like segment trees or dynamic programming yet plunge below 1 500 Elo on game theory, greedy tricks, and messy case work .
Zero hard-tier solves confirm the cliff.
🔍 Why Submissions Fail
A treemap comparison finds o3-mini commits many wrong algorithms and missed insights while humans mainly slip on implementation details.
Models also trip on samples they never run locally, something human coders catch instantly
🔁 More Tries, Better Outcomes
Letting o4-mini fire ten attempts lifts its rating by about 540 points and doubles medium-tier pass rate, but hard problems remain untouched at 0 %
🧠 Does Reasoning Help
Adding explicit chain-of-thought boosts combinatorics by up to 1 400 Elo and lifts knowledge tags, yet barely moves observation tags such as greedy or ad-hoc, hinting current reasoning traces miss the aha moment
💰 Terminal Power Matters
The authors estimate around 400 Elo of the published 2 700 score comes from terminal access that lets a model compile, unit-test, and brute-force patterns during inference
🚨 BREAKING: DeepSeek dropped a core Transformer architecture improvement.
A traditional transformer is basically a long stack of blocks, and each block has a “main work path” plus a “shortcut path” called the residual connection that carries the input around the block and adds it back at the end.
Each block in this original transformer architecture does some work (self attention or a small feed forward network), then it adds the block’s input back onto the block’s output, which is why people describe it as a “main path” plus a “shortcut path.”
Hyper-Connections is a drop-in change to that shortcut path, because instead of carrying 1 stream of activations through the stack, the model carries a small bundle of parallel streams, then it learns how to mix them before a block and after a block.
Standard Transformers pass information through 1 residual stream. Hyper-Connections turn that into n parallel streams, like n lanes on a highway. Small learned matrices decide how much of each lane should mix into the others at every layer.
In a normal residual connection, each layer takes the current hidden state, runs a transformation, then adds the original back, so information can flow forward without getting stuck.
In this new Hyper-Connections, the layer does not see just 1 hidden state, it sees a small bundle of them, and before the layer it learns how to mix that bundle into the input it will process.
So in a traditional transformer block, wherever you normally do “output equals input plus block(input),” Hyper-Connections turns that into “output bundle equals a learned mix of the input bundle plus the block applied to a learned mix,” so the shortcut becomes more flexible than a plain add.
After this learned layer, the "Hyper-Connections" mechanism again learns how to mix the transformed result back into the bundle, so different lanes can carry different kinds of information, and the model can route signal through the shortcut in a more flexible way.
The catch is that if those learned mixing weights are unconstrained, stacking many blocks can make signals gradually blow up or fade out, and training becomes unstable in big models.
This paper proposes mHC, which keeps Hyper-Connections but forces every mixing step to behave like a safe averaging operation, so the shortcut stays stable while the transformer still gets the extra flexibility from multiple lanes.
---
The paper shows this stays stable at 27B scale and beats both a baseline and unconstrained Hyper-Connections on common benchmarks.
HC can hit about 3000x residual amplification, mHC keeps it around 1.6x.
This image compares 3 ways to build the shortcut path that carries information around a layer in a transformer.
The left panel is the normal residual connection, where the model adds the layer output back to the original input so training stays steady as depth grows.
The middle panel is Hyper-Connections, where the model keeps several parallel shortcut streams and learns how to mix them before the layer, around the layer, and after the layer, which can help quality but can also make the shortcut accidentally amplify or shrink signals when many layers stack.
The right panel is mHC, which keeps the same Hyper-Connections idea but forces those mixing steps to stay in a constrained safe shape every time, so the shortcut behaves like a controlled blend and stays stable at large scale.
What “hyper-connection” means here.
You widen the residual from size C to n×C, treat it as n streams, and learn 3 tiny mixing pieces per layer.
One mixes the residual streams with each other, this is the crucial one. One gathers from the streams into the layer. One writes results back to the streams.
The paper’s contribution is to keep the first one in the safe “doubly stochastic” set, so it mixes without amplifying.
A MASSIVE 303 page study from the very best Chinese Labs.
The paper explains how code focused language models are built, trained, and turned into software agents that help run parts of development.
These models read natural language instructions, like a bug report or feature request, and try to output working code that matches the intent.
The authors first walk through the training pipeline, from collecting and cleaning large code datasets to pretraining, meaning letting the model absorb coding patterns at scale.
They then describe supervised fine tuning and reinforcement learning, which are extra training stages that reward the model for following instructions, passing tests, and avoiding obvious mistakes.
On top of these models, the paper surveys software engineering agents, which wrap a model in a loop that reads issues, plans steps, edits files, runs tests, and retries when things fail.
Across the survey, they point out gaps like handling huge repositories, keeping generated code secure, and evaluating agents reliably, and they share practical tricks that current teams can reuse.
Overview of the evolution of code large language models (Code-LLMs) and related ecosystems from 2021 to 2025.
Evolution of programming development and research landscapes in AI-powered code generation.
A MASSIVE 303 page study from the very best Chinese Labs.
The paper explains how code focused language models are built, trained, and turned into software agents that help run parts of development.
These models read natural language instructions, like a bug report or feature request, and try to output working code that matches the intent.
The authors first walk through the training pipeline, from collecting and cleaning large code datasets to pretraining, meaning letting the model absorb coding patterns at scale.
They then describe supervised fine tuning and reinforcement learning, which are extra training stages that reward the model for following instructions, passing tests, and avoiding obvious mistakes.
On top of these models, the paper surveys software engineering agents, which wrap a model in a loop that reads issues, plans steps, edits files, runs tests, and retries when things fail.
Across the survey, they point out gaps like handling huge repositories, keeping generated code secure, and evaluating agents reliably, and they share practical tricks that current teams can reuse.
Overview of the evolution of code large language models (Code-LLMs) and related ecosystems from 2021 to 2025.
Evolution of programming development and research landscapes in AI-powered code generation.
Agents, robots, and us: Skill partnerships in the age of AI
- Today’s technologies could theoretically automate more than half of current US work hours. This reflects how profoundly work may change
- By 2030, about $2.9 trillion of economic value could be unlocked in the United States
- Demand for AI fluency—the ability to use and manage AI tools—has grown 7X in two years, faster than for any other skill in US job postings. The surge is visible across industries and likely marks the beginning of much bigger changes ahead.
Two-thirds of US work hours require only nonphysical capabilities.
"The Impact of Artificial Intelligence on Human Thought"
A big 132 page report.
AI is shifting real thinking work onto external systems, which boosts convenience but can weaken the effort that builds understanding and judgment,
A pattern the paper frames through cognitive offloading and cognitive load theory, and then tracks into social effects like standardized language and biased information flows, and manipulation tactics that target human psychology.
It says use AI to cut noise and routine steps, keep humans doing the heavy mental lifting, and add controls because personalization, deepfakes, and opaque models can steer choices at scale.
🧵 Read on 👇
🧵2/n. ⚙️ The Core Concepts
Cognitive load theory says working memory is limited, so AI helps when it reduces extraneous load and hurts when it replaces the germane load needed to build skill.
In plain terms, let tools clean up the interface and fetch data, but keep people doing the analysis, explanation, and sense‑making.
🧵3/n. 🧰 Offloading and memory
Handing memory, calculation, or choosing to an external aid frees attention now, yet steady offloading can dull recall and critical habits later.
The paper casts web search, note apps, and assistants as a human‑machine transactive memory system, useful when sources are reliable, risky when they are biased or wrong.
That is why trust and verification routines matter as much as speed.