🇨🇳China unveils world's first brain-like AI Model SpikingBrain1.0
Upto 100X faster while being trained on less than 2% of the data typically required.
Designed to mimic human brain functionality, uses much less energy. A new paradigm in efficiency and hardware independence.
Marks a significant shift from current AI architectures
Unlike models such as GPT and LLaMA, which use attention mechanisms to process all input in parallel, SpikingBrain1.0 employs localized attention, focusing only on the most relevant recent context.
Potential Applications:
- Real-time, low-power environments
- Autonomous drones and edge computing
- Wearable devices requiring efficient processing
- Scenarios where energy consumption is critical
This project is part of a larger scientific pursuit of neuromorphic computing, which aims to replicate the remarkable efficiency of the human brain, which operates on only about 20 watts of power.
---
arxiv .org/abs/2509.05276
🧠 The idea for the Human-brain-inspired linear or hybrid-linear LLMs for the SpikingBrain architecture.
- SpikingBrain replaces most quadratic attention with linear and local attention, mixes in selective full attention where it matters, and adds an adaptive spiking activation so the model computes only on meaningful events.
- It proves the whole recipe works at scale by training and serving on MetaX C550 GPUs, which are non‑NVIDIA devices, without giving up quality on common benchmarks.
- The headline efficiencies come from 3 levers working together, linear attention for compressed memory, MoE for token-wise sparsity, and spiking for micro-level sparsity.
🛠️ Training without starting from scratch
They do a conversion‑based continual pre‑training, not full pre‑training, by remapping QKV weights from a Transformer checkpoint into linear and local attention, then training for ~150B tokens across 8k, 32k, and 128k contexts.
Because the converted attention maps stay close to the original softmax map, the model converges quickly and avoids the ~10T token budgets seen in many scratch runs, which is <2% of typical data.
Post‑training then adds instruction following and reasoning in 3 short stages without harming the base capabilities.
Shows how to train LLM agents to finish long, multi step tasks by letting them act in real environments with reinforcement learning.
Across 27 tasks, the trained agents rival or beat top proprietary models.
Most agents are trained on single turn data, so they fail when a job needs many decisions with noisy feedback.
AgentGym-RL splits the system into separate parts, the environments, the agent loop, and training, so each can improve on its own.
It supports mainstream algorithms and realistic tasks, and the agent learns by acting, seeing results, and adjusting across different settings.
The key method, ScalingInter-RL, starts with short interactions to master basics, then slowly allows longer runs so the agent can explore and plan.
This staged horizon schedule stabilizes learning, prevents pointless loops, and encourages planning, reflection, and recovery after mistakes.
A 7B model trained with this setup matches or beats much larger open models and competes well with strong commercial ones.
They also find that putting more compute into training and test time interaction, like more steps or samples, often helps more than adding parameters.
How the AgentGym-RL framework works.
At the center is the LLM agent. It takes an instruction, interacts with an environment for several turns, and then produces actions. Each action changes the environment, and the environment sends feedback back to the agent. This cycle repeats many times.
The environment itself is handled by a server that can simulate different types of tasks. These include web browsing, searching, coding, playing games, doing science tasks, or controlling embodied agents. The environment client manages the interaction and communicates through standard protocols.
Every full cycle of actions and observations is called a trajectory. These trajectories are collected and then used to update the agent’s policy with reinforcement learning algorithms like PPO, GRPO, RLOO, or REINFORCE++.
The framework is modular. The environment, the agent, and the training part are separated. This makes it flexible, easy to extend, and suitable for many types of realistic tasks.
The diagram highlights how the agent learns not by memorizing answers, but by trying actions, getting feedback, and improving its decision making across different domains.
The idea behind ScalingInter-RL, the training method used in the paper.
If an agent is trained with only short interactions, it learns to handle easy tasks but fails on harder ones. If it is trained with very long interactions from the start, it wastes effort, falls into repeated mistakes, or even collapses and performs poorly.
ScalingInter-RL solves this by gradually increasing the number of interaction steps during training. At first, the agent works with short horizons to master the basics and build reliable skills.
Then, the horizon is expanded in stages, letting the agent explore more, refine its behavior, and learn how to recover from errors.
By the final stages, the agent can manage long, complex tasks because it has grown its abilities step by step instead of being overloaded too early. This staged process makes training stable and produces stronger agents.
📢 Another Brilliant research just dropped from @GoogleResearch - a major advancement for a systematic way to generate expert-level scientific software automatically.
An LLM plus tree search turns scientific coding into a score driven search engine.
This work builds an LLM + Tree Search loop that writes and improves scientific code by chasing a single measurable score for each task.
The key idea is to treat coding for scientific tasks as a scorable search problem.
That means every candidate program can be judged by a simple numeric score, like how well it predicts, forecasts, or integrates data. Once you have a clear score, you can let a LLM rewrite code again and again, run the code in a sandbox, and use tree search to keep the best branches while discarding weaker ones
With compact research ideas injected into the prompt, the system reaches expert level and beats strong baselines across biology, epidemiology, geospatial, neuroscience, time series, and numerical methods.
Training speed: less than 2 hours on 1 T4 vs 36 hours on 16 A100s.
In bioinformatics, it came up with 40 new approaches for single-cell data analysis that beat the best human-designed methods on a public benchmark.
In epidemiology, it built 14 models that set state-of-the-art results for predicting COVID-19 hospitalizations.
🧵 Read on 👇
🧵2/n. ⚙️ The Core Concepts
Empirical software is code built to maximize a quality score on observed data, and any task that fits this framing becomes a scorable task.
This view turns software creation into a measurable search problem, because every candidate program is judged by the same numeric target.
This framing also explains why the method can travel across domains, since only the scoring function changes.
🧵3/n. This figure is breaking down both how the system works.
The top-left part shows the workflow. A scorable problem and some research ideas are given to an LLM, which then generates code. That code is run in a sandbox to get a quality score. Tree search is used to decide which code branches to keep improving, balancing exploration of new ideas with exploitation of ones that already look promising.
On the right, different ways of feeding research ideas into the system are shown. Ideas can come from experts writing direct instructions, from scientific papers that are summarized, from recombining prior methods, or from LLM-powered deep research. These sources make the search more informed and help the model produce stronger, more competitive solutions.
So overall, the loop of tree search plus targeted research ideas turns an LLM from a one-shot code generator into a system that steadily climbs toward expert-level performance.
"There's no language out there in nature. You don't go out in nature and there's words written in the sky for you.. There is a 3D world that follows laws of physics."
Language is purely generated signal.
AI models trained on linguistic signals fail when the task requires embodied physical common sense in a world with real constraints.
To give some context to her explanation, this bechmark has 75 vision‑language models and show they still struggle with physical world understanding.
The paper attributes the failures to missing physical priors and limited exposure to physically grounded data. Even with images and text, the models lack robust knowledge of object properties and dynamics, reinforcing that linguistic data is not the same as contact with a law‑governed world.
LLMs get stuck when they think too long in a single line, early tokens steer them into a narrow path and they rarely recover, which the authors call Tunnel Vision.
ParaThinker trains native parallel thinking, it spins up multiple distinct reasoning paths at once and then fuses them into 1 answer, which lifts accuracy a lot with tiny latency cost.
Sensational fact, if you only keep 1 thing: 12.3% average gain for 1.5B, 7.5% for 7B, with only 7.1% extra latency.
ParaThinker shows that training LLMs to think in parallel paths instead of just longer single chains avoids tunnel vision, giving up to 12.3% accuracy gains with only 7.1% extra latency, letting smaller models beat much larger ones.
🧵 Read on 👇
🧵2/n. 🧩 Why longer thinking stalls
When the model makes a mistake early on, it keeps building on that mistake.
The longer it goes down that wrong path, the less chance it has to recover.
This stuck behavior is what the authors call Tunnel Vision, and it explains why just letting the model think longer doesn’t always improve accuracy.
🧵3/n. 🚀 Why parallel width helps
The real slowdown in decoding comes from moving data in and out of memory, not from doing the math.
When the model runs several reasoning paths in parallel, it reuses the same memory loads for more work.
Even running 16 paths at once takes less than 2x the time of a single path, so parallel thinking is both faster and more accurate.
Shows how to speed up LLM agents while cutting cost and keeping answers unchanged.
30% lower total cost and 60% less wasted cost at comparable acceleration.
Agents plan step by step, so each call waits for the previous one, which drags latency.
Speculative planning fixes that by having a cheap draft agent guess next steps while a stronger agent checks them in parallel.
Fixed guess lengths backfire, small guesses barely help, big guesses waste tokens when a check disagrees.
Dynamic Speculative Planning learns how far to guess, then stops early to avoid wasted calls.
A tiny online predictor learns how many steps will be right using reinforcement learning.
1 knob lets teams bias for speed or cost, either by skewing training or adding a small offset.
If a guess is wrong, extra threads stop and execution resumes from the verified step.
Across OpenAGI and TravelPlanner, the dynamic policy matches the fastest fixed policy while spending fewer tokens
The result is clear, faster responses, lower bills, and 0 loss in task quality.
How Dynamic Speculative Planning, manages when and how far to guess ahead during an agent’s planning.
The top line called Predictor decides how many future steps to guess, marked by k. For example, k=2 means guess 2 steps ahead, while k=3 means guess 3 steps ahead. These guesses are carried out by a lighter agent called Approximation, and then checked in parallel by a stronger agent called Target.
If the guesses match the stronger agent, they are confirmed and execution continues. If they don’t match, shown with an X, all ongoing speculative threads are canceled, and the system resumes from the last correct step. This prevents wasted work from wrong guesses.
At the same time, an online Trainer collects data about each state and the chosen k. This data is then used to update the Predictor so it learns better over time without slowing down the agent. In other words, the system keeps improving its ability to guess how far it can safely look ahead.
So overall, the figure captures this cycle: make a guess, verify, cancel if wrong, and then use that experience to improve the predictor for the next run
why using a fixed number of speculative steps can either be too cautious or too aggressive.
On the left side, the system guesses only 2 steps ahead. Because it does not speculate far, it avoids wasted work, but the total task takes longer since the process is not sped up much.
On the right side, the system guesses 6 steps ahead. This makes things faster at first, but when the stronger agent disagrees at step 4, everything predicted after that point becomes useless. Steps 5 and 6 are wasted, which means extra cost without benefit.
So the main point is that small guesses save resources but barely speed things up, while large guesses speed things up but waste a lot of work when they go wrong. This shows why a fixed guessing strategy is not efficient and why an adaptive method is needed.