To train a model with reinforcement learning in a fully decentralized setting using community-contributed GPUs, we open-source several novel infrastructure components.
PRIME-RL: A fully asynchronous reinforcement learning framework designed for decentralized training. Decoupling of rollout generation, model training, and weight broadcasting enables training across heterogeneous, unreliable networks.
SHARDCAST: A library for distributing large files via a HTTP-based tree-topology network that efficiently propagates updated model weights from training nodes to the decentralized inference workers.
TOPLOC Validators: A validator service using TOPLOC proofs to ensure that rollouts from untrusted inference workers can be trusted for model training.
INTELLECT-2 is trained using rule-based rewards across math and coding problems and length rewards guiding the model to follow its thinking budget. We introduce modifications to the standard GRPO recipe to enhance training stability and encourage faster learning.
Two-step asynchronous RL: The broadcast of new policy weights is fully overlapped with ongoing inference and training, eliminating communication bottlenecks.
Two-Sided GRPO Clipping: Stabilizes training by mitigating gradient spikes with two-sided token probability ratio clipping.
Advanced Data Filtering: Combines offline and online filtering to select challenging tasks, significantly enhancing model learning efficiency.
Experiments:
We report results from two main experiments: TARGET-SHORT, an experimental run with short target lengths to train an efficient reasoning model, and, TARGET-LONG, our main run with longer target lengths.
Reward Trajectories:
Benchmark Performance:
We were able to increase the performance of QwQ-32B on math and coding benchmarks. Since QwQ-32B is already very strong and heavily trained using RL, huge additional improvements will likely require better base models or higher quality data.
INTELLECT-2 demonstrates that globally decentralized RL works.
Now, we’re focusing on tool-assisted reasoning, crowdsourcing higher-quality data, and optimizing our infrastructure and training recipe to build frontier open models.
Join us to build open source and decentralized AGI.
First introduced by @a1zhang in Oct 2025, the RLM has access to its inputs through a variable in a persistent Python REPL.
The model can inspect & transform that variable with code, and pipe parts of it into sub-LLMs with tools without ever loading the potentially huge input data into its context.
RLMs are a simple, flexible form of context folding that doesn't depend on lossy summarization.
Instead, the model proactively delegates context to:
- Python scripts (search, filter, transform)
- Sub-LLMs (fresh instances) for parallel work
- Iterative answer edits until it's actually correct
Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack
Achieving state-of-the-art performance for its size across math, code and reasoning
Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more
INTELLECT-3 is a 106B parameter Mixture-of-Experts model trained with both SFT and RL on top of the GLM 4.5 Air Base model.
Both stages, including multiple ablations, were carried out on a 512-GPU H200 cluster over the course of two months.
Our Training Stack
+ PRIME-RL: Our scalable, asynchronous RL trainer
+ Verifiers: Our unified library used for hundreds of envs and evals on the Environments Hub
+ Sandboxes: Custom container infra optimized for agentic RL
+ Compute: Orchestration & observability for 512 H200s
We're scaling our Open-Source Environments Program
As part of this, we're committing hundreds of thousands of $ in bounties and looking for partners who want to join our mission to accelerate open superintelligence
Join us in building the global hub for environments and evals
Over the past 2 months, we've crowdsourced 400+ environments and 80+ verified implementations through our bounties and RL residency across:
+ Autonomous AI Research
+ Browser Automation
+ Theorem Proving
+ Subject-Specific QA
+ Legal/Finance Tasks
+ Many more...
Thank you to everyone whose claimed a bounty or joined the residency!
From autonomous AI research, MCP integrations, and browser automation to domain specific environments for economically valuable tasks across law, finance, and tax.
NanoGPT Speedrun
Evaluate code-generation and pretraining capabilities of LLMs via NanoGPT Speedrun benchmark.
- Request 8–1,000+ GPU clusters
- Get quotes from up to 50+ providers in 24h
- Re-sell idle GPUs back to our spot market
- Support from our research team
Expanding our Compute Exchange
- Find the best and most cost-effective reserved instance offers across 50+ providers
- Re-sell idle GPUs from your reserved cluster on our liquid compute market
- H100s, H200s, B200s, and NVL72 clusters available today
Additional Features
- Orchestration with SLURM, Ray or Kubernetes
- Monitoring with Grafana dashboards
- Native integrations into our full-stack infra offering: Environment Hub, Sandboxes, Reinforcement Fine-Tuning, Multi-Node Training
- Dedicated support from our research team