Scaling Agent Learning via Experience Synthesis
π: arxiv.org/abs/2511.03773
Scaling training environments for RL by simulating them with reasoning LLMs!
Environment models + Replay-buffer + New tasks = cheap RL for any environments!
- Strong improvements over non-RL-ready environments and multiple model families!
- Works better in sim-2-real RL settings β Warm-start for high-cost environments
π§΅1/7
π€ Why is RL for language agents still so hard?
- Real environments are slow, fragile, and expensive - every rollout requires servers, resets, and long horizons
- Tasks are static & limited, so agents quickly overfit instead of learning to generalize
- Rewards are often sparse, noisy, or wrong (e.g., web pages change, UI breaks)
- RL Infra is painful: Docker, VMs, flaky APIs, no parallelism, no action sandboxing
π Scaling RL for LLM agents is bottlenecked not by models, but by diverse environments and experience data
π§΅2/7
π οΈ Recipe for DreamGym
1. Reasoning-Based Experience Model:
- Takes in (state, action, task) and predicts next state + reward
- Uses CoT to explain why the transition happens β more stable RL signals
2. Grounded Replay-Buffer:
- Each prediction is conditioned on similar reference trajectories
- Reduces hallucinations & keeps transitions causally consistent
Entropy-Based Task Curriculum:
- New tasks are automatically paired with reward signals in DreamGym!
- Finds tasks that are intermediate difficulty for the current policy
- Generates variations of these tasks to challenge the agent β automatic curriculum learning
π Scalable βsynthetic environmentβ that produces consistent, diverse, reward-dense experience for RL training
π§΅3/7
π Main Results
DreamGym beats or matches state-of-the-art RL β without any real-environment rollouts.
- On non-RL-ready environments such as WebArena, DreamGym is the only scalable method that makes RL work, +30%β over all baselines
- On RL-costly environments such as WebShop & ALFWorld, DreamGym matches PPO / GRPO trained with 80K real interactions β using 0 real interactions
- With Sim-to-Real transfer, DreamGym-S2R gives +40% performance while using 10Γ less real data
- Works across different RL algorithms and model families β both algorithm- and backbone-agnostic
π§΅4/7
π₯ Why synthetic experience matters: efficiency, transfer, and scaling
β±οΈ 4Γ faster training than real-env RL β reduce infra rollout time and GPU hours!
π Synthetic training with WebShop tasks β policy transferred to WebArena & ALFWorld!
π Smoother, faster learning curves; DreamGym-S2R converges highest and faster!
π§΅5/7
π What makes DreamGym work?
- Removing replay grounding β big drop in state consistency, higher hallucination
- Removing reasoning traces β states become shallow, reward errors rise
- Removing task generation β policy plateaus early
In short:
β
Reasoning = stable transitions
β
Replay buffer = factual grounding
β
Entropy-based curriculum = continuous improvement
LLM agents donβt just need more data β they need useful experience.
π§΅6/7
π§© Takeaway
The bottleneck in agent RL wasnβt the policy β it was the experience.
DreamGym shows that when we replace heterogeneous real-world rollouts with reasoning-grounded synthetic experience, RL becomes:
β
scalable & unified
β
less human engineering
β
generalizable and transferable
If we can synthesize the right experiences, we can train agents before they ever touch the real world.
π Paper: arxiv.org/abs/2511.03773
π Feedback & collaborations welcome
π§΅7/7
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.
