On ProofBench-Advanced—where models prove formal mathematical theorems—GPT-5 scores 20%. Gemini Deep Think IMO Gold hits 65.7%. DeepSeek Math V2 (Heavy) scores 61.9%.
That's second place—but Gemini isn't open source.
This is the best open math model in the world. And DeepSeek released the weights. Apache 2.0.
Here's what they discovered:
1/ Why Normal LLMs Break on Real Math
Most large language models are great at sounding smart, but:
- They’re rewarded for the final answer, not the reasoning.
- If they accidentally land on the right number with bad logic, they still get full credit.
- Over time they become “confident liars”: fluent, persuasive, and sometimes wrong.
That’s fatal for real math, where the proof is the product.
To fix this, DeepSeek Math V2 changes what the model gets rewarded for: not just being right, but being rigorously right.
2/ The Core Idea: Generator + Verifier
Instead of one model doing everything, DeepSeek splits the job: 1. Generator – the “mathematician”
- Produces a full, step-by-step proof.
2. Verifier – the “internal auditor”
- Checks the proof for logical soundness.
- Ignores the final answer. It only cares about the reasoning.
This creates an internal feedback loop:
One model proposes, the other critiques.
3/ The Secret Sauce: 1.0/0.5/0.0
The verifier doesn't just say yes or no. It scores on three levels:
It's the referee saying: "You solved it, but this wouldn't pass peer review."
When the generator sees 0.5, it re-reads its own proof, finds the weak steps, tightens the argument.
The model learns to debug its reasoning, not just guess better.
4/ Putnam, IMO, and ProofBench
- Putnam 2024 – ~118/120
- IMO-Gold level performance
- On a “basic” proof dataset, V2 almost perfectly solves the set
- On an “advanced” dataset with long, tricky proofs, it still performs strongly, while many other large models collapse in accuracy
Models without this internal verifier do okay on short, easy proofs…
…and then fall off a cliff on long, complex ones.
DeepSeek’s architecture shows that built-in self-checking is the difference between “good at math questions” and “actually good at proofs.”
5/ How They Trained It
Big risk is if the generator gets smart and the verifier stays weak, the generator learns to game it.
Three-phase solution:
Phase 1 – Human Cold Start. Contest problems graded by expert mathematicians. Anchors the verifier to real standards.
Phase 2 – Meta-Verification. The verifier can start hallucinating errors—seeing problems that don't exist. Solution: a second model checks whether critiques are legitimate or noise.
Phase 3 – Scaled Compute. For the hardest problems, human labeling is too slow. Run many verification passes, use majority vote as training signal.
Humans set the rules. Compute scales them.
6/ Big Model, Big Hardware
DeepSeek Math V2 is a Mixture-of-Experts (MoE) model with about 685B parameters.
- Only some “experts” are active per problem, so each step is cheaper than a dense 685B model
- But all those parameters still have to live in GPU memory
The code is open. The bottleneck is compute.
7/ How You Actually Use It: Agent Mode
In practice, you don’t just send one prompt and get a perfect proof.
Instead, you run it in agent mode, something like:
1. Ask it to solve a problem. 2. It generates a proof and a self-verification score. 3. If the score is 0.5, you feed its own critique back in:
- “Refine this proof based on the issues you identified.”
4. Repeat this refinement loop a few times (e.g., up to 8 rounds). 5. Stop when it produces a 1.0 proof or you’re satisfied.
You're managing a feedback loop, not passively waiting for output.
8/ Limitations
Creativity. Great at formal reasoning and polishing proofs. Still struggles with problems needing genuinely novel insight.
Cost. Those record-setting scores rely on many proof attempts and verification runs. Real-world use means cheaper settings, slightly lower performance.
Residual Errors. The verifier is still a neural net. It can be fooled. Error rate is lower, not zero.
This is a big leap toward reliable reasoning—not "perfect AI mathematician."
9/ From Chatbots to Reasoners
DeepSeek Math V2 represents more than just a math milestone.
The pattern here will spread:
- Split generation and verification
- Train on proof quality, not just right answers
- Add self-critique loops and meta-verifiers
This is the template for any domain where being wrong is expensive—code, science, law, anything that needs to survive peer review.
Battery storage is already scaling—159 GW deployed globally, 926 GW projected by 2033.
Renewables needed it first. Now AI needs it too.
Tesla is deploying Megapacks at data centers. China is deploying 30 GW this year, integrating storage directly into AI buildout.
Why? Data centers can’t scale without solving three problems:
- 7-year interconnection queues
- power quality GPUs demand
- backup without diesel permits
Batteries solve all three ↓
Why AI Data Centers Need Batteries
Interconnection is broken. Utility connection takes 7+ years. Batteries bypass it. Skip the queue.
GPUs break traditional power. Training loads swing 90% at 30 Hz. Batteries smooth it in 30 milliseconds.
Diesel doesn’t scale. Permitting is hard. For 20-hour backup, batteries are cost-competitive.
The math: ~1% of data center capex.
The Scale
Global capacity: 159 GW by end-2024. Up 85% from 86 GW in 2023. Projected: 926 GW by 2033.
Cost curve: $115/kWh in 2024, down 84% from $723/kWh in 2013. Still falling.
Economics flipped. Solar plus 4-hour storage runs ~$76/MWh. New gas peakers cost $80-120/MWh.
The universe isn’t just expanding — it’s speeding up
13.8 billion years after the Big Bang, astronomers expected gravity to slowly slow cosmic expansion. Instead, when they looked deep into space, they found the opposite: the universe is accelerating.
Whatever drives that acceleration makes up ~70% of the cosmos.
We call it dark energy.
We can measure it. We can see its effects. So what is it, really?
How we figured this out
Cepheid stars: the distance trick
Henrietta Leavitt discovered that certain stars (Cepheid variables) get brighter and dimmer with a regular period — and that period tells you their true brightness → lets us measure distance to faraway galaxies.
Redshift: galaxies on the move
Vesto Slipher used spectra of galaxies to show many had their light stretched to longer, redder wavelengths.
Redder → moving away faster.
Hubble & the expanding universe
Edwin Hubble and Milton Humason combined Cepheid distances with redshift and found a pattern:
>The farther a galaxy is, the faster it’s receding.
That’s the Hubble–Lemaître law: clear evidence that the universe is expanding.
The shock: expansion is accelerating
In the 1990s, two teams studied Type Ia supernovae, stellar explosions so consistent in brightness that they act like “standard candles.”
By comparing how bright they should be to how bright they look, you can get distance.
By measuring redshift, you get how fast they’re moving away.
The surprise:
• The supernovae were dimmer and farther away than expected.
• That only made sense if, over billions of years, the universe’s expansion had sped up instead of slowing down.
This cosmic acceleration is what we now attribute to dark energy.
🚨The White House just launched the Genesis Mission — a Manhattan Project for AI
The Department of Energy will build a national AI platform on top of U.S. supercomputers and federal science data, train scientific foundation models, and run AI agents + robotic labs to automate experiments in biotech, critical materials, nuclear fission/fusion, space, quantum, and semiconductors.
Let’s unpack what this order actually builds, and how it could rewire the AI, energy, and science landscape over the next decade:
1/ At the core is a new American Science and Security Platform.
DOE is ordered to turn the national lab system into an integrated stack that provides:
• HPC for large-scale model training, simulation, inference
• Domain foundation models across physics, materials, bio, energy
• AI agents to explore design spaces, evaluate experiments, automate workflows
• Robotic/automated labs + production tools for AI-directed experiments and manufacturing
National-scale AI scientist + AI lab tech as infrastructure.
2/ The targets are very explicit and very strategic.
Within 60 days, DOE has to propose at least 20 “national challenges” in:
It pulls in nearly $60B per quarter — almost all from a handful of hyperscalers who plan their AI roadmaps around Jensen's release cycle.
But three shifts are happening at once:
• Google is committing up to one million TPUs to Anthropic starting 2026 — the first credible alternative at frontier scale.
• Racks are already pushing hundreds of kilowatts, with megawatt systems on the horizon.
• Nvidia has $26B in commitments to rent back its own GPUs from cloud partners — up from $12.6B last quarter.
The real constraint isn't chips anymore — it's power and memory.
Over the next 3–5 years, this creates a fractured landscape: Nvidia GPUs as the default utility, Google TPUs as a real second ecosystem, and hyperscalers racing to escape the Nvidia tax.
Let’s walk through how that actually plays out:
1/ Nvidia now: dominant, concentrated, and structurally exposed
Nvidia's latest quarter (fiscal Q3 2026) is extreme:
• $57B in revenue, +62% YoY
• $51.2B from data center alone
But it’s dangerously concentrated:
• 4 customers = 61% of sales (up from 56% last quarter).
And Nvidia is renting back its own chips:
• $26B in off-balance-sheet commitments to pay hyperscalers for GPUs they can’t fully rent out, up from $12.6B the prior quarter.
That creates a circular-demand loop:
• sell chips to clouds → invest in AI customers → rent those same chips back when there’s slack.
Not a crisis. But a structural dependency that didn’t exist two years ago.
2/ TPUs: no longer just for Google
Google's 7th-gen TPU (Ironwood) is the first built for inference over training.
Why that matters: the bottleneck is shifting. Training a frontier model is a one-time cost. Serving it to billions of users is the recurring expense that actually scales.
The specs reflect this:
• Pods scale to 9,216 accelerators
• 1.77 PB of HBM3E memory per pod
• 9.6 Tb/s optical circuit-switching fabric
That memory pool and interconnect matter more than peak FLOPs. Large inference workloads are memory-bandwidth bound. Ironwood is designed around that reality.
Google's framing: "The hardest part is now serving AI to billions of users."