The most insane thing to me:
The whole training only cost $5.576 million or ~55 days on a 2048xH800 cluster. This is TINY compared to the Llama, GPT or Claude training runs.
- 671B MoE with 37B activate params
- DeepSeek MoE architecture: 1 shared expert and 256 routed experts, 8 active routed experts for each token
- Multi-head Latent Attention (low-rank joint compression for attention keys and values)
- Multi-token prediction (useful for spec decoding and better usage of training data) - for D additional tokens you want to predict there are D additional sequential modules
- some ablation study results for MTP:
- auxiliary-loss-free load-balancing to prevent MoE collapse; ablation study below:
- 14.8T training tokens
- BPE tokenizer 128k vocab
- only 61 layers :(
- 2.788M H800 training hours with FP8 mixed precision
- pre-training --> two stage context length expansion, first to 32k tokens and then to 128k tokens
--> post-training uses SFT and RL to align with human preferences and for distilling R1 reasoning capabilities
- a bunch of interesting stuff on the infrastructure, and how they got the FP8 training to work (I don't really care about that), but worth reading if you are into that
• • •
Missing some Tweet in this thread? You can try to
force a refresh
not serious results - have to rerun although the 397B results looked fine:
# 21 Qwen3.5 397B A17B :)
# 40 Qwen3.5 35B A3B :/ (few buggy responses)
# 44 Qwen3.5 122B A10B :( (lots of buggy responses)
(all with thinking, and tested with official providers. checked all the responses for errors!)
- Step 3.5 with a very impressive score. It uses an insane amount of tokens, but the validity ratio and the score honestly speaks for itself. It makes use of these tokens. Similar to DeepSeek-V3.2 Speciale. Look at the far right of the plot.
- Kimi-K2.5 has basically the same score as K2.
I can't say exactly how much more efficient it is, because I couldn't get K2 Thinking to run again. But I remember K2 taking around 40k token for each response. So we might be talking about 2x more efficient reasoning!
- Seed 2.0 Pro was impressive especially in terms of reasoning efficiency
- Qwen3.5 397B A17B outperforms GLM-5 and Minimax M2.5 which is good, but also uses more tokens.
- Qwen3.5 35B A3B is okay, still below GPT-OSS-20B. Had a couple of responses where it included the reasoning, which makes the score 0 for that word.
- Qwen3.5 122B A10B was bugged. It was returning the reasoning for like half of all responses. Should score much higher without buggy responses.
- GLM-5 was somewhat disappointing. I thought it would get on a level with Kimi-K2.5. But I checked all responses manually, no faulty responses like the smaller Qwen3.5 models.
- MiniMax M2.5 score is fine, but nothing crazy. GPT-OSS-120B kinda mogs it. Minimax also had issues with the output format. It responded with the list of words, but after like 100 words it would go back to reasoning or changing the format.
- Seed 1.8, 2.0 Lite and Mini perform all well for their price and are all super token efficient.
Usually all models should start their raw response with the starting word that was given to them. Like everything below 97% on this chart is kinda sus.
So Qwen3.5 122B A10B was clearly bugged and it seems like Llama3.1 8B was too (not gonna rerun llama tho)
I had to run Kimi-K2.5 twice, because last time I tested LisanBench with Qwen3 I had to add a /nothink tag, that I then forgot.
It didn't really do much, scores of 1893 vs 1849. This is also a nice demonstration of how accurate the scores are.
Coding and Mathematics AGI
- METR 50% time horizons above 24 hours - my mean estimate is 30.8 hours, 2 day time horizons possible within frontier labs when accounting for 60 day lag
- if 2025 was the year of agents, then 2026 will be the year of multi-agent systems
- agents delegating work to subagents -> the start of the agent economy and the great unhobbling!
Most of our current math and coding benchmarks will get saturated!
- Epoch Capabilities Index ( > 175 )
- FrontierMath Levels 1-3 ( > 95% )
- ARC-AGI 1 and 2 ( > 95% )
- SimpleQA verified ( > 95% )
- Simple-Bench ( > 90% )
- SWE-Bench-verified ( > 90% )
- Terminal-Bench 2 ( > 90% )
- WeirdML v2 ( > 85% )
- Humanities Last Exam ( > 80% )
- FrontierMath Level 4 ( > 75% )
- Cybench ( > 70% )
- GDPval ( > 70 % win rate, no ties)
- GSO ( > 65% )
- ARC-AGI-3 ( > 60% and > 80% if they go for o3-preview comparable compute budgets or continual learning breakthrough happens)
- more evals like gdpval that capture economic value of models and systems
- big focus white collar work and large acceleration of science: specifically i see acceleration in medicine, biology, chemistry, finance, legal, administrative work
- automation of white collar work will be enabled by having reliable and fast computer use agents
- reliable computer use agents will also have implications for how you use the internet. this is OpenAI's big goal: become the hub to the internet and delegate shopping and whatever to agents!
Big models launches to get hyped for in 2026:
- Claude 5 - Claude 5.5
- Gemini 3.5 - Gemini-4
- GPT-5.3 - GPT-6
(everything in between possible, but Gemini 4 ~ 80%, Claude 5.5 ~ 70%, GPT-6 ~ 60% likely before 2027)
- DeepSeek-V4
- Grok-5
- Qwen-4
- Kimi-K3, GLM-5, MiniMax M3
- more korean models and a bunch of american open-source models :)
The gap between closed and open labs will narrow in H1 2026 due to DeepSeek-V4, then widen in the later half of the year, especially on economically valuable tasks.
Closed models will be much more reliable. But we will still have Opus 4.5+ level open models by the end of 2026.
Most frontier models will be around 5-10T params. If we see GPT-6 and Gemini-4 at the end of 2026 10T+ param models are possible. These models + harnesses will be the first not research agents. We should also see much better live models with voice and video mode.
Model architecture:
- we will see both, more efficient architectures and more expressive architectures!
- hybrid architectures for even longer context windows, diffusion models for speed on edge devices, but also models that double down on full attention or even more expressive attention mechanisms
- looped language models, other recurrent architectures and continual learning will enable much smaller reasoning models! (TRM on ARC-AGI has paved the way for the reasoning core)
- big improvements in reasoning efficiency
in my 2025 prediction I included a prediction for 2026 that I stand by:
- "someone (Anthropic) figures out efficient test-time-training [...], this will be the next paradigm for 2026 and lead to superintelligence"
General outlook and some random thoughts:
- it will be clear to everybody that Anthropic has the mandate and is ahead of everyone else
- OpenAI, Anthropic and Google will remain frontier labs
- decent chance that Anthropic overtakes OpenAI's valuation and both are valued > 1T
- DeepSeek will join them with V4 as THE chinese frontier lab
- xAI will likely repeat Grok-4, Grok-5 will be great on benchmarks but Elon persists on slop-maxxing the model
- AI generated video content will take off with Veo-4 and Sora-3, consistent minute long videos will be possible
- embodied intelligence will start to take off by RL through world models
- full self-driving solved, waymo and tesla everywhere
- the stock market will have a 20%+ drawdown
- 15% chance of OpenAI going bankrupt and getting acquired by Microsoft due to collapse of oracle or a market crash, caused by rapidly deteriorating economic situation (unemployment, inflation)
- push against AI will become a common theme in most advanced western economies as unemployment rises
- populist right wing parties continue to gain traction in europe
- trump/republicans will lose midterm elections
On multi-agent systems
- they are great because they allow systems to work in parallel
- the main agent can avoid keeping every detail and reasoning step in context and avoids context rot
- the agent economy is an economy where AI agents pay other AI agents to do work
- i think agent harnesses /scaffolding will take off too, like claude-code but for general purpose
On continual learning:
- my original prediction included that it is achieved by using mech-interp, but the most likely candidates are:
- brute-force through smarter use of .md files, something like skills, basically self-meta-prompting: LLMs writing and optimizing instructions for itself
- nested learning architectures like HOPE
- targeted training via mech-interp, LoRa on steroids
We are approaching continual learning no matter what.
Release cycle times are decaying every year as FLOPS become more abundant.
In 2023 it was roughly every 6 months, 2024 every 3 months, now frontier labs like OpenAI, Google, Anthropic are approaching model releases monthly.
In the limit this will also lead to continual learning, even without any fancy new tech.
A few more observations after replicating the Tower of Hanoi game with their exact prompts:
- You need AT LEAST 2^N - 1 moves and the output format requires 10 tokens per move + some constant stuff.
- Furthermore the output limit for Sonnet 3.7 is 128k, DeepSeek R1 64K, and o3-mini 100k tokens. This includes the reasoning tokens they use before outputting their final answer!
- all models will have 0 accuracy with more than 13 disks simply because they can not output that much!
- the max solvable sizes WITHOUT ANY ROOM FOR REASONING (floor(log2(output_limit/10)))
DeepSeek: 12 disks
Sonnet 3.7 and o3-mini: 13 disks
- If you actually look at the output of the models you will see that they don't even reason about the problem if it gets too large:
"Due to the large number of moves, I'll explain the solution approach rather than listing all 32,767 moves individually"
- At least for Sonnet it doesn't try to reason through the problem once it's above ~7 disks. It will state what the problem and the algorithm to solve it and then output its solution without even thinking about individual steps.
- it's also interesting to look at the models as having a X% chance of picking the correct token at each move
- even with a 99.99% probability the models will eventually make an error simply because of the exponentially growing problem size
Decomposing helps the model to focus more on reasoning as it keeps the problem size smaller but it will basically get lost in the algorithm and repeat steps.
It needs the history to pick up where it left, although Tower of Hanoi is in theory stateless. Like after each move the optimal move only depends on the current state.
This was with Gemini 2.0 Flash. So not a reasoning model. But as you can see decomposing the full problem that should be solvable into 2^n-1 steps, into chunks of 5 steps at a time worsens performance.
But I also observed this peak in token usage across the models I tested at around 9-11 disks.
That's simply the threshold where the models say: "Fuck off I'm not writing down 2^n_disks - 1 steps"
LisanBench is a simple, scalable, and precise benchmark designed to evaluate large language models on knowledge, forward-planning, constraint adherence, memory and attention, and long context reasoning and "stamina".
"I see possible futures, all at once. Our enemies are all around us, and in so many futures they prevail. But I do see a way, there is a narrow way through." - Paul Atreides
How it works:
Models are given a starting English word and must generate the longest possible sequence of valid English words. Each subsequent word in the chain must:
- Differ from the previous word by exactly one letter (Levenshtein distance = 1)
- Be a valid English word
- Not repeat any previously used word
The benchmark repeats this process across multiple starting words of varying difficulty. A model's final score is the cumulative length of its longest valid chains from the starting words.
Results:
- o3 is by far the best model, mainly because it is the only model that manages to escape from parts of the graph with very low connectivity and many dead-ends
(slight caveat: o3 was by far the most expensive one to run and used ~30-40k reasoning tokens per starting word)
- Opus 4 and Sonnet 4 with 16k reasoning tokens, also perform extremely, especially Opus which was able to beat o3 at 3 starting words with only one third of the reasoning tokens!
- Claude 3.7 with thinking taking 4th place ahead of o1
- other OpenAI reasoning models perform all well, but size does make a difference! o1 is ahead of o4-mini high and o3-mini
- Gemini models perform a bit worse than their Anthropic and OpenAI counterparts, but they have by far the longest outputs - they are a bit delusional and keep yapping; they don't realize and stop when they made a mistake
- strongest non-reasoning models: Grok-3, GPT-4.5, Sonnet 3.5 and 3.7, Opus 4, Sonnet 4, DeepSeek-V3 and Gemini 1.5 Pro
- Grok 3, Sonnet 3.5 and 3.7 are a surprise!!
Inspiration:
LisanBench draws from benchmarks like AidanBench and SOLO-Bench. However, unlike AidanBench, it’s extremely cost-effective, trivially verifiable and doesn't rely on an Embedding model - the entire benchmark cost only ~$50 for 57 models.
And unlike SOLO-Bench, it explicitly tests knowledge and applies stronger constraints, which makes it more challenging!
Verification:
Verification uses the words_alpha.txt dictionary from github.com/dwyl/english-w… (~370,105 words), but for scalability, only words from the largest connected component (108,448 words) are used.
Easy Scaling, Difficulty Adjustment & Accuracy improvements:
- Scaling and Accuracy: Just add more starting words or increase the number of trials per word.
- Difficulty: Starting words vary widely - from those with 72 neighbors to those with just 1 - effectively distinguishing between moderately strong and elite models. Difficulty can also be gauged via local connectivity and branching factor.
Why is it challenging?
LisanBench uniquely stresses:
- Forward planning: avoiding dead ends by strategic word choices - models must find the narrow way through
- Knowledge: wide vocabulary is essential
- Memory and Attention: previously used words must not be repeated
- Precision: strict adherence to Levenshtein constraints
- Long-context reasoning: coherence and constraint-tracking over hundreds of steps
- Output stamina: some models break early during long generations — LisanBench exposes that, which is critical for agentic use cases
The two beautiful plots below show that the starting words are very different in difficulty. Some are in low connectivity regions, some in high-connectivity regions and others are just surrounded by dead-ends!
Just as Paul Atreides had to navigate the political, cultural, and metaphysical maze of his destiny, LLMs in LisanBench must explore vast word graphs, searching for the Golden Path - the longest viable chain without collapse.
We will know the chosen model when it appears.
It will be the one that finds the Golden Path and avoids every dead end. Right now, for the most difficult starting word "abysmal", the longest chain found is just 2, although it is also part of the >100k connected component. So there is a narrow way through!
More plots with full leaderboard below!
Full Leaderboard:
It is worse than AidanBench in one regard. Because it operates on a word / character level and not on a sentence / paragraph level it is affected by tokenization! So models with better tokenizers, all else being equal, should perform better.
And I only tested 10 words, doing 25 or 50 for good measure would probably help with stability.