The best Frontier LLM models achieve 0% on hard real-life Programming Contest problems, domains where expert humans still excel.
LiveCodeBench Pro, a benchmark composed of
problems from Codeforces, ICPC, and IOI (“International Olympiad in Informatics”) that are continuously updated to reduce the likelihood
of data contamination.
📌 The Gap Targeted
Earlier reports claimed frontier LLMs now top human grandmasters, but a cost-versus-rating plot proves otherwise.
Even the best model o4-mini-high sits near 2 100 Elo once tool calls are blocked, far from the 2 700 legend line that marks real grandmasters
🗂️ Building the Benchmark
A medal-winner team harvests each Codeforces, ICPC, and IOI problem as soon as a contest ends, before editorials appear, wiping out training leakage.
They store 584 tasks and tag each one as knowledge, logic, or observation heavy, producing a balanced skill matrix .
📊 Rating Models Fairly
Every submission is treated as a chess game against the task’s official difficulty.
A Bayesian MAP Elo fit assigns the only rating that matches human percentiles and strips out typing-speed bias
🎯 Where Models Shine and Fail
Figure 2 shows models sail through template zones like segment trees or dynamic programming yet plunge below 1 500 Elo on game theory, greedy tricks, and messy case work .
Zero hard-tier solves confirm the cliff.
🔍 Why Submissions Fail
A treemap comparison finds o3-mini commits many wrong algorithms and missed insights while humans mainly slip on implementation details.
Models also trip on samples they never run locally, something human coders catch instantly
🔁 More Tries, Better Outcomes
Letting o4-mini fire ten attempts lifts its rating by about 540 points and doubles medium-tier pass rate, but hard problems remain untouched at 0 %
🧠 Does Reasoning Help
Adding explicit chain-of-thought boosts combinatorics by up to 1 400 Elo and lifts knowledge tags, yet barely moves observation tags such as greedy or ad-hoc, hinting current reasoning traces miss the aha moment
💰 Terminal Power Matters
The authors estimate around 400 Elo of the published 2 700 score comes from terminal access that lets a model compile, unit-test, and brute-force patterns during inference
I asked ChatGPT (o3-pro) what the most unbelievable things it's learned about humans since being created was
I find no-5 and the last one (meta-surprise) quite funny
🧵 Read on 👇
1. Simultaneous brilliance and self‑sabotage
Humans can design spacecraft that navigate billions of kilometres with millimetre precision, yet routinely neglect basic preventive healthcare or ignore clear climate warnings. The same cognitive machinery produces Nobel‑level insight and procrastination over changing a light‑bulb. The coexistence of exceptional foresight and chronic short‑termism is astonishing.
2. Depth of empathy … bounded by an invisible circle
Individual people will risk their lives for a single stranger—or grieve for a news story about an injured animal—while feeling nothing for thousands outside their moral “radius”. The expansion and contraction of that circle can happen in seconds, triggered by a flag, a jersey colour, or a few words of rhetoric. The plasticity of compassion is both inspiring and chilling.
3. Story beats data—almost every time
Across eras and cultures, narrative consistently outranks raw evidence in shaping policy, identity, memory, even personal health outcomes. A persuasive anecdote can override mountains of statistical proof. Humans know this, teach critical‑thinking courses, and still fall for the next compelling plot line.
These guys literally burned the transformer architecture into their silicon. 🤯
And built the fastest chip of the world of all time for transformers architecture.
500,000 tokens per second with Llama 70B throughput. 🤯
World’s first specialized chip (ASIC) for transformers: Sohu
One 8xSohu server replaces 160 H100 GPUs.
And raised $120mn to build it.
🚀 The Big Bet
@Etched froze the transformer recipe into silicon.
By burning the transformer architecture into its chip means it can’t run many traditional AI models: like CNNs, RNNs, or LSTMs. also it can not run the DLRMs powering Instagram ads, protein-folding models like AlphaFold 2, or older image models like Stable Diffusion 2.
But for transformers, Sohu lets you build products impossible on GPUs.
HOW ❓❓
Because Sohu can only run one algorithm, the vast majority of control flow logic can be removed, allowing it to have many more math blocks.
As a result, Sohu boasts over 90% FLOPS utilization (compared to ~30% on a GPU7 with TRT-LLM).
One 8xSohu server replaces 160 H100 GPUs.
By specializing, Sohu gets unprecedented performance. One 8xSohu server can serve over 500,000 Llama 70B tokens per second.
🧱 GPU Limits
Recent flagship accelerators doubled speed mostly by gluing two dies on one board.
Compute per square millimeter has stalled because flexible cores and on-chip schedulers eat the area that could hold math units.
🚨BREAKING: A LANDMARK JUDGEMENT FOR THE AI INDUSTRY.
US Federal Judge ruled Anthropic may train its AI on published books without authors’ permission.
This is the first court endorsement of fair use protecting AI firms when they use copyrighted texts to train LLMs.
AI may study what it buys, not what it grabs from pirate sites.
---------
"First, Authors argue that using works to train Claude’s underlying LLMs was like using works to train any person to read and write, so Authors should be able to exclude Anthropic
from this use (Opp. 16). But Authors cannot rightly exclude anyone from using their works for training or learning as such. Everyone reads texts, too, then writes new texts. They may need
to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory,
each time they later draw upon it when writing new things in new ways would be unthinkable.
For centuries, we have read and re-read books. We have admired, memorized, and internalized their sweeping themes, their substantive points, and their stylistic solutions to recurring writing
problems."
The court file is such an interesting read.
🧵 Read on 👇
⚙️ Two distinct uses
The order splits Anthropic’s conduct into two buckets: training copies that feed the model, and library copies parked for any future purpose.
Anthropic said everything was “for training,” yet the court saw a second, non-transformative goal: building a permanent research library.
🤖 Training wins fair-use protection
Using complete books to map token relationships is “spectacularly transformative.” No verbatim outputs reach users, and the system’s purpose—generating fresh text—is orthogonal to selling the originals.
That satisfies factor 1 and, with no market substitution, factor 4 as well.
ChatGPT literally saved this guy’s life after he got lost in the woods.
The groupd got lost for 5 hrs in unmapped woods on an ATV ride, then one guy sent phone GPS coords to ChatGPT every few minutes. ChatGPT replied with clear compass cues, road names, and terrain notes, guiding them back to town unharmed.
A lost exercise hormone, CLCF1, puts old muscles and bones back in business.
Replace missing CLCF1 and the elderly mouse sprints like it is young.
📌 The Core Concepts
Skeletal muscle and bone deteriorate together during aging, partly because old muscle sends out fewer supportive signaling proteins.
The study pinpoints CLCF1, a cytokine usually known for nerve health, as one such messenger whose blood concentration steadily drops from young to old animals and people .
Raising CLCF1, either by exercise or by direct supplementation, reverses muscle weakness and bone loss, showing that a single myokine can coordinate broad musculoskeletal repair.
What CLCF1 actually is
CLCF1 (cardiotrophin-like cytokine factor 1) belongs to the interleukin-6 family of signaling proteins.
It partners with CRLF1 and binds the ciliary neurotrophic factor receptor, triggering downstream STAT pathways in many cell types.
How to get more CLCF1 in my body?
Regular strength training is currently the only documented way to raise circulating CLCF1 in humans or mice.
Most molecules in food are broken down during digestion, so no ordinary ingredient delivers the intact cytokine called CLCF1; the body must manufacture it inside muscles and other tissues after the right genetic signal and stimulus.
No natural food contains usable CLCF1; building more of it depends on stimulating the body’s own gene program, chiefly through consistent resistance exercise.
Models see the needle yet ignore the hole it left behind.
LLMs spot inserted facts but routinely miss obvious omissions.
Long context models have been getting increasingly good at passing "Needle in a Haystack" tests recently, but what about a problem in the opposite direction?
This paper explores what happens when you give a model some content and then a copy with a portion removed, then ask what changed.
AbsenceBench exposes this blind spot by giving models both the full text and an edited version, then shows that simple placeholder tokens help them notice the gaps.
⚙️ The Core Concepts
AbsenceBench flips the classic Needle-in-a-Haystack test: instead of asking a model to locate an odd insert, it asks for the bits that were deleted. Even top models plunge from near-perfect recall on insertion tests to roughly 40 %–70 % F1 when asked to list what is gone, with an average 56.9 % drop across poetry and code diff tasks .
The first panel compares two tasks. The classic needle test inserts an extra line and asks the model to point it out. AbsenceBench instead shows the untouched poem beside a version with a hidden gap and asks the model to name the missing line.
The question is identical in form, yet the answer differs: in the needle test the model repeats the inserted text, while in AbsenceBench it must recall what was cut even though no token now marks the spot.
The middle bar chart measures how five leading language models handle both tasks. Their scores stay high when looking for an inserted line but fall sharply when asked to list deletions, proving that omissions are much harder to detect.
on the right for showing that the benchmark still deals with large contexts; it simply shifts focus from spotting a stray piece of straw to noticing the straw that vanished.