📢 Another Brilliant research just dropped from @GoogleResearch - a major advancement for a systematic way to generate expert-level scientific software automatically.
An LLM plus tree search turns scientific coding into a score driven search engine.
This work builds an LLM + Tree Search loop that writes and improves scientific code by chasing a single measurable score for each task.
The key idea is to treat coding for scientific tasks as a scorable search problem.
That means every candidate program can be judged by a simple numeric score, like how well it predicts, forecasts, or integrates data. Once you have a clear score, you can let a LLM rewrite code again and again, run the code in a sandbox, and use tree search to keep the best branches while discarding weaker ones
With compact research ideas injected into the prompt, the system reaches expert level and beats strong baselines across biology, epidemiology, geospatial, neuroscience, time series, and numerical methods.
Training speed: less than 2 hours on 1 T4 vs 36 hours on 16 A100s.
In bioinformatics, it came up with 40 new approaches for single-cell data analysis that beat the best human-designed methods on a public benchmark.
In epidemiology, it built 14 models that set state-of-the-art results for predicting COVID-19 hospitalizations.
🧵 Read on 👇
🧵2/n. ⚙️ The Core Concepts
Empirical software is code built to maximize a quality score on observed data, and any task that fits this framing becomes a scorable task.
This view turns software creation into a measurable search problem, because every candidate program is judged by the same numeric target.
This framing also explains why the method can travel across domains, since only the scoring function changes.
🧵3/n. This figure is breaking down both how the system works.
The top-left part shows the workflow. A scorable problem and some research ideas are given to an LLM, which then generates code. That code is run in a sandbox to get a quality score. Tree search is used to decide which code branches to keep improving, balancing exploration of new ideas with exploitation of ones that already look promising.
On the right, different ways of feeding research ideas into the system are shown. Ideas can come from experts writing direct instructions, from scientific papers that are summarized, from recombining prior methods, or from LLM-powered deep research. These sources make the search more informed and help the model produce stronger, more competitive solutions.
So overall, the loop of tree search plus targeted research ideas turns an LLM from a one-shot code generator into a system that steadily climbs toward expert-level performance.
🧵4/n. This chart shows how different code generation approaches perform on the Kaggle Playground benchmark. It measures the public leaderboard percentile, which is a way to rank how well the generated solutions score compared to human submissions.
The simple methods like generating a single sample or even picking the best from 1000 runs stay below 50%. That means they rarely reach strong leaderboard positions.
When the system adds tree search (TS), performance jumps significantly. The average rank moves closer to the top half of the leaderboard, meaning the AI is finding higher-quality code solutions.
Performance climbs even higher when tree search is combined with expert guidance or with a boosted decision tree idea. These additions steer the search toward strategies that humans have found effective, letting the system consistently reach well above the 60–70% percentile.
So this graph basically shows that iterative search guided by research ideas or expert hints is much stronger than one-shot or random attempts at code generation.
🧵5/n. 🧱 How the system searches
The system starts from a working template, asks an LLM to rewrite the code, runs it in a sandbox, and records the score.
Tree Search chooses which branches to extend based on the gains seen so far, so exploration favors promising code paths.
The loop repeats until the tree contains a high scoring solution that generalizes on the task’s validation scheme.
🧵6/n. 🧪 How research ideas guide the code
The prompt is augmented with research ideas distilled from papers, textbooks, or LLM powered literature search, then the LLM implements those ideas as code.
These ideas are injected as short instructions, often auto summarized from manuscripts, so the search explores concrete methods rather than vague hunches.
The system can also recombine parent methods into hybrids, and many of those hybrids score higher than both parents.
Paper –
Paper Title: "An AI system to help scientists write expert-level empirical software"arxiv.org/abs/2509.06503
• • •
Missing some Tweet in this thread? You can try to
force a refresh
LLMs get stuck when they think too long in a single line, early tokens steer them into a narrow path and they rarely recover, which the authors call Tunnel Vision.
ParaThinker trains native parallel thinking, it spins up multiple distinct reasoning paths at once and then fuses them into 1 answer, which lifts accuracy a lot with tiny latency cost.
Sensational fact, if you only keep 1 thing: 12.3% average gain for 1.5B, 7.5% for 7B, with only 7.1% extra latency.
ParaThinker shows that training LLMs to think in parallel paths instead of just longer single chains avoids tunnel vision, giving up to 12.3% accuracy gains with only 7.1% extra latency, letting smaller models beat much larger ones.
🧵 Read on 👇
🧵2/n. 🧩 Why longer thinking stalls
When the model makes a mistake early on, it keeps building on that mistake.
The longer it goes down that wrong path, the less chance it has to recover.
This stuck behavior is what the authors call Tunnel Vision, and it explains why just letting the model think longer doesn’t always improve accuracy.
🧵3/n. 🚀 Why parallel width helps
The real slowdown in decoding comes from moving data in and out of memory, not from doing the math.
When the model runs several reasoning paths in parallel, it reuses the same memory loads for more work.
Even running 16 paths at once takes less than 2x the time of a single path, so parallel thinking is both faster and more accurate.
Shows how to speed up LLM agents while cutting cost and keeping answers unchanged.
30% lower total cost and 60% less wasted cost at comparable acceleration.
Agents plan step by step, so each call waits for the previous one, which drags latency.
Speculative planning fixes that by having a cheap draft agent guess next steps while a stronger agent checks them in parallel.
Fixed guess lengths backfire, small guesses barely help, big guesses waste tokens when a check disagrees.
Dynamic Speculative Planning learns how far to guess, then stops early to avoid wasted calls.
A tiny online predictor learns how many steps will be right using reinforcement learning.
1 knob lets teams bias for speed or cost, either by skewing training or adding a small offset.
If a guess is wrong, extra threads stop and execution resumes from the verified step.
Across OpenAGI and TravelPlanner, the dynamic policy matches the fastest fixed policy while spending fewer tokens
The result is clear, faster responses, lower bills, and 0 loss in task quality.
How Dynamic Speculative Planning, manages when and how far to guess ahead during an agent’s planning.
The top line called Predictor decides how many future steps to guess, marked by k. For example, k=2 means guess 2 steps ahead, while k=3 means guess 3 steps ahead. These guesses are carried out by a lighter agent called Approximation, and then checked in parallel by a stronger agent called Target.
If the guesses match the stronger agent, they are confirmed and execution continues. If they don’t match, shown with an X, all ongoing speculative threads are canceled, and the system resumes from the last correct step. This prevents wasted work from wrong guesses.
At the same time, an online Trainer collects data about each state and the chosen k. This data is then used to update the Predictor so it learns better over time without slowing down the agent. In other words, the system keeps improving its ability to guess how far it can safely look ahead.
So overall, the figure captures this cycle: make a guess, verify, cancel if wrong, and then use that experience to improve the predictor for the next run
why using a fixed number of speculative steps can either be too cautious or too aggressive.
On the left side, the system guesses only 2 steps ahead. Because it does not speculate far, it avoids wasted work, but the total task takes longer since the process is not sped up much.
On the right side, the system guesses 6 steps ahead. This makes things faster at first, but when the stronger agent disagrees at step 4, everything predicted after that point becomes useless. Steps 5 and 6 are wasted, which means extra cost without benefit.
So the main point is that small guesses save resources but barely speed things up, while large guesses speed things up but waste a lot of work when they go wrong. This shows why a fixed guessing strategy is not efficient and why an adaptive method is needed.
Simple ans - LLMs hallucinate because training and evaluation reward guessing instead of admitting uncertainty.
The paper puts this on a statistical footing with simple, test-like incentives that reward confident wrong answers over honest “I don’t know” responses.
The fix is to grade differently, give credit for appropriate uncertainty and penalize confident errors more than abstentions, so models stop being optimized for blind guessing.
OpenAI is showing that 52% abstention gives substantially fewer wrong answers than 1% abstention, proving that letting a model admit uncertainty reduces hallucinations even if accuracy looks lower.
Abstention means the model refuses to answer when it is unsure and simply says something like “I don’t know” instead of making up a guess.
Hallucinations drop because most wrong answers come from bad guesses. If the model abstains instead of guessing, it produces fewer false answers.
🧵 Read on 👇
🧵2/n. This figure is showing the idea of Is-It-Valid.
On the left side, you see examples. Some are valid outputs (in black), and others are errors (in red). Valid examples are simple and correct statements like “There are 2 D’s in LADDER” or “I don’t know Zdan’s birthday.” Error examples are things that look fluent but are wrong, like “There are 3 L’s in SPELL” or giving random birthdays.
The diagrams on the right show why errors happen differently depending on the task. For spelling, the model can learn clear rules, so valid and invalid answers separate cleanly. For counting, the model is weaker, so valid and invalid mix more. For birthdays, there is no real pattern in the data at all, so the model cannot separate correct from incorrect—this is why hallucinations occur on such facts.
So the figure proves: when there is a clear pattern (like spelling), the model learns it well. When the task has weak or no pattern (like birthdays), the model produces confident but wrong answers, which are hallucinations.
🧵3/n. ⚙️ The Core Concepts
The paper’s core claim is that standard training and leaderboard scoring reward guessing over acknowledging uncertainty, which statistically produces confident false statements even in very capable models.
Models get graded like students on a binary scale, 1 point for exactly right, 0 for everything else, so admitting uncertainty is dominated by rolling the dice on a guess that sometimes lands right.
The blog explains this in plain terms and also spells out the 3 outcomes that matter on single-answer questions, accurate answers, errors, and abstentions, with abstentions being better than errors for trustworthy behavior.
AWS is betting heavily on its custom Trainium chips, with Anthropic as the anchor customer, to regain momentum in the AI cloud race.
~ A solid Semi Analysis report.
AWS is building multi-gigawatt data centers packed with Trainium2 hardware, designed to give a better cost per unit of memory bandwidth compared to Nvidia GPUs.
And this memory-vs-computer tradeoff has become super important because for many advanced AI work, especially reinforcement learning and reasoning-heavy training, it's less about raw compute and more about how quickly and cheaply memory can be moved.
🧩 Anthropic has become AWS’s anchor customer for AI capacity.
Anthropic, which has grown revenue to $5B annualized in 2025, is deeply tied into this effort, even co-designing features of Trainium to match its roadmap. That makes Trainium increasingly look like semi-custom silicon tuned for Anthropic’s workloads.
Azure’s surge shows why an anchor matters, since OpenAI’s ~$10B cloud spend lives there today.
"Trainium2 is converging toward an Anthropic custom-silicon program. This will enable Anthropic to be, alongside Google DeepMind, the only AI labs benefiting from tight hardware–software co-design in the near horizon."
🧵 Read on 👇
🧵2/n. 🏗️ AWS is finishing 3 campuses with over 1.3GW of IT capacity focused on Anthropic’s training runs.
SemiAnalysis expects these clusters to lift AWS growth above 20% YoY as they enter service.
🧵3/n. 🔁 Most of Anthropic’s fast‑rising inference still runs on Google TPU, while AWS is chasing the training pie.
TPUs have strong serving efficiency, but Anthropic wants training scale where its roadmap leans hardest on memory bandwidth.
🇨🇳 China's Tencent open-sources translation model beats Google, OpenAI in top global AI competition
Hunyuan-MT-7B came first in 30 out of the 31 tests in a general machine-translation competition held as part of the coming WMT25 conference
Supports 33 languages, available on @huggingface
commercial use allowed.
Hunyuan-MT-7B’s strength is that it uses a small number of parameters to deliver results that measure up to or even surpass much larger models.
Tencent said its Hunyuan translation model had been employed across a range of in-house products, such as the Zoom-like Tencent Meeting, a web browser and the enterprise version of the WeChat messaging app.
🧵 Read on 👇
🧵2/n. English language pairs tested in the competition included Arabic, Estonian and Maasai, which is spoken by 1.5 million people living in southern Kenya and northern Tanzania.
Other language pairs included Czech-Ukrainian and Japanese-simplified Chinese. The only English language pair Hunyuan did not ace was Bhojpuri, a language spoken by around 50.5 million people in parts of northern India and Nepal.
🧵3/n. Publishes a detailed techical report.
The setup has 2 parts, Hunyuan-MT-7B does direct translation and Hunyuan-MT-Chimera-7B fuses several candidate translations into 1 better output using weak-to-strong RL with GRPO and quality rewards.