Thread by @jaseweston on Thread Reader App

Scaling Agent Learning via Experience Synthesis
📝: arxiv.org/abs/2511.03773

Scaling training environments for RL by simulating them with reasoning LLMs!

Environment models + Replay-buffer + New tasks = cheap RL for any environments!

- Strong improvements over non-RL-ready environments and multiple model families!
- Works better in sim-2-real RL settings → Warm-start for high-cost environments
🧵1/7

🤖 Why is RL for language agents still so hard?

- Real environments are slow, fragile, and expensive - every rollout requires servers, resets, and long horizons
- Tasks are static & limited, so agents quickly overfit instead of learning to generalize
- Rewards are often sparse, noisy, or wrong (e.g., web pages change, UI breaks)
- RL Infra is painful: Docker, VMs, flaky APIs, no parallelism, no action sandboxing

👉 Scaling RL for LLM agents is bottlenecked not by models, but by diverse environments and experience data
🧵2/7

🛠️ Recipe for DreamGym

1. Reasoning-Based Experience Model:
- Takes in (state, action, task) and predicts next state + reward
- Uses CoT to explain why the transition happens → more stable RL signals

2. Grounded Replay-Buffer:
- Each prediction is conditioned on similar reference trajectories
- Reduces hallucinations & keeps transitions causally consistent

Entropy-Based Task Curriculum:
- New tasks are automatically paired with reward signals in DreamGym!
- Finds tasks that are intermediate difficulty for the current policy
- Generates variations of these tasks to challenge the agent → automatic curriculum learning

👉 Scalable “synthetic environment” that produces consistent, diverse, reward-dense experience for RL training
🧵3/7

📊 Main Results
DreamGym beats or matches state-of-the-art RL — without any real-environment rollouts.

- On non-RL-ready environments such as WebArena, DreamGym is the only scalable method that makes RL work, +30%↑ over all baselines
- On RL-costly environments such as WebShop & ALFWorld, DreamGym matches PPO / GRPO trained with 80K real interactions — using 0 real interactions
- With Sim-to-Real transfer, DreamGym-S2R gives +40% performance while using 10× less real data
- Works across different RL algorithms and model families → both algorithm- and backbone-agnostic
🧵4/7

🔥 Why synthetic experience matters: efficiency, transfer, and scaling

⏱️ 4× faster training than real-env RL → reduce infra rollout time and GPU hours!
🌍 Synthetic training with WebShop tasks → policy transferred to WebArena & ALFWorld!
📈 Smoother, faster learning curves; DreamGym-S2R converges highest and faster!
🧵5/7

🔍 What makes DreamGym work?

- Removing replay grounding → big drop in state consistency, higher hallucination
- Removing reasoning traces → states become shallow, reward errors rise
- Removing task generation → policy plateaus early

In short:
✅ Reasoning = stable transitions
✅ Replay buffer = factual grounding
✅ Entropy-based curriculum = continuous improvement

LLM agents don’t just need more data — they need useful experience.
🧵6/7

🧩 Takeaway
The bottleneck in agent RL wasn’t the policy — it was the experience.
DreamGym shows that when we replace heterogeneous real-world rollouts with reasoning-grounded synthetic experience, RL becomes:
✅ scalable & unified
✅ less human engineering
✅ generalizable and transferable

If we can synthesize the right experiences, we can train agents before they ever touch the real world.

📄 Paper: arxiv.org/abs/2511.03773
🙌 Feedback & collaborations welcome
🧵7/7

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll