Scaling training environments for RL by simulating them with reasoning LLMs!
Environment models + Replay-buffer + New tasks = cheap RL for any environments!
- Strong improvements over non-RL-ready environments and multiple model families!
- Works better in sim-2-real RL settings β Warm-start for high-cost environments
π§΅1/7
π€ Why is RL for language agents still so hard?
- Real environments are slow, fragile, and expensive - every rollout requires servers, resets, and long horizons
- Tasks are static & limited, so agents quickly overfit instead of learning to generalize
- Rewards are often sparse, noisy, or wrong (e.g., web pages change, UI breaks)
- RL Infra is painful: Docker, VMs, flaky APIs, no parallelism, no action sandboxing
π Scaling RL for LLM agents is bottlenecked not by models, but by diverse environments and experience data
π§΅2/7
π οΈ Recipe for DreamGym
1. Reasoning-Based Experience Model:
- Takes in (state, action, task) and predicts next state + reward
- Uses CoT to explain why the transition happens β more stable RL signals
2. Grounded Replay-Buffer:
- Each prediction is conditioned on similar reference trajectories
- Reduces hallucinations & keeps transitions causally consistent
Entropy-Based Task Curriculum:
- New tasks are automatically paired with reward signals in DreamGym!
- Finds tasks that are intermediate difficulty for the current policy
- Generates variations of these tasks to challenge the agent β automatic curriculum learning
π Scalable βsynthetic environmentβ that produces consistent, diverse, reward-dense experience for RL training
π§΅3/7
π Main Results
DreamGym beats or matches state-of-the-art RL β without any real-environment rollouts.
- On non-RL-ready environments such as WebArena, DreamGym is the only scalable method that makes RL work, +30%β over all baselines
- On RL-costly environments such as WebShop & ALFWorld, DreamGym matches PPO / GRPO trained with 80K real interactions β using 0 real interactions
- With Sim-to-Real transfer, DreamGym-S2R gives +40% performance while using 10Γ less real data
- Works across different RL algorithms and model families β both algorithm- and backbone-agnostic
π§΅4/7
π₯ Why synthetic experience matters: efficiency, transfer, and scaling
β±οΈ 4Γ faster training than real-env RL β reduce infra rollout time and GPU hours!
π Synthetic training with WebShop tasks β policy transferred to WebArena & ALFWorld!
π Smoother, faster learning curves; DreamGym-S2R converges highest and faster!
π§΅5/7
π What makes DreamGym work?
- Removing replay grounding β big drop in state consistency, higher hallucination
- Removing reasoning traces β states become shallow, reward errors rise
- Removing task generation β policy plateaus early
πAgent Learning via Early Experienceπ
π: arxiv.org/abs/2510.08558
- SFT for agents is sparse; RL on long-horizons is hard
We provide new mid-training signals that work: 1) Implicit next state world modeling task 2) Self-reflection on alternate states
- Strong improvements over 8 environments and multiple model families
- Works well for subsequent RL!
π§΅1/5
Recipe π¨βπ³:
1) Implicit world modeling
Augments expert trajectories with alternative actions and predicted next states, training the policy to internalize transition dynamics before deployment.
2) Self-reflection
Augments expert actions with self-generated explanations, training the policy to reason about and revise its own decisions.
Both methods use K alternative actions proposed by the initial policy (LLM).
π§΅2/5
Results: across 8 benchmarks we see performance improvements over instruction-tuned model prompting or imitation learning for the task.
π§΅3/5
Hybrid Reinforcement (HERO): When Reward Is Sparse, Itβs Better to Be Dense π¦ΈββοΈ πͺ
π: arxiv.org/abs/2510.07242
- HERO bridges 0β1 verifiable rewards and dense reward models into one 'hybrid' RL method
- Tackles the brittleness of binary signals and the noise of pure reward models -> better results!
βοΈ Stratified normalization anchors dense scores within verifier groups
βοΈ Variance-aware weighting emphasizes harder, high-variance prompts
βοΈ Stable + informative rewards, no drift
π Results:
π₯ +11.7 pts vs RM-only, +9.2 pts vs verifier-only on hard-to-verify reasoning tasks
π₯ Generalizes across Qwen and OctoThinker models
π₯ Works well when training with easy-to-verify/hard-to-verify/mixed samples.
(a) Rule-based Reward: precise but brittle β gives 0 to many almost-correct answers.
β correctness but too strict.
(b) Reward Model: smooth but misaligned β sometimes rewards wrong answers (false positives) or underrates correct ones (false negatives).
β coverage but too loose.
Neither sparse nor dense alone is enough (see table).
π Bridging Offline & Online RL for LLMs π
π: arxiv.org/abs/2506.21495
New paper shows on verifiable & non-verifiable tasks:
- Online DPO & GRPO give similar performance.
- Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also.
- Offline DPO is way behind.
- Combining verifiable + non-verifiable works! Cross-transfer gains.
- Recipes for how to make this work.
π§΅1/4
- Online DPO results in a 59.4% increase in AlpacaEval LC winrate & 56.2% in ArenaHard score compared to standard DPO. DPO is poor due to its offline nature.
- Online DPO achieves comparable performance to online GRPO.
- But more surprisingly so does semi-online DPO.
π§΅2/4
We find similar results on verifiable tasks as well.
Semi-online DPO even performs a little bit better on some tasks.
π§΅3/4
π¨New paper!π¨
Meta-Rewarding LMs
- LM is actor, judge & meta-judge
- Learns to reward actions better by judging its own judgments (assigning *meta-rewards*)
- Improves acting & judging over time without human labels
... beats Self-Rewarding LMs
How does an LLM judge judgments? We use LLM-as-a-Meta-Judge, see prompt in figure.
- Make N judgments for a given pair of responses & calc pairwise meta-judgments
- Compute Elo score of the judgments via this matrix
- Create LLM-as-a-judge preference pairs via Elo scores
π§΅(3/6)
The resulting data is remarkably high quality/impactful for training, even though itβs through self-alignment, outperforming other instruction tuning datasets for the same data size (π > πͺ)