Jason Weston Profile picture
@Meta+NYU. NLP from scratch(Pretrain+FT LLM) 2008,MemNet (pre-Transformer) 2015, DrQA(pre-RAG) 2017, BlenderBot(dialog pre-ChatGPT) 2018+, Self-Rewarding+more!

Nov 7, 7 tweets

Scaling Agent Learning via Experience Synthesis
πŸ“: arxiv.org/abs/2511.03773

Scaling training environments for RL by simulating them with reasoning LLMs!

Environment models + Replay-buffer + New tasks = cheap RL for any environments!

- Strong improvements over non-RL-ready environments and multiple model families!
- Works better in sim-2-real RL settings β†’ Warm-start for high-cost environments
🧡1/7

πŸ€– Why is RL for language agents still so hard?

- Real environments are slow, fragile, and expensive - every rollout requires servers, resets, and long horizons
- Tasks are static & limited, so agents quickly overfit instead of learning to generalize
- Rewards are often sparse, noisy, or wrong (e.g., web pages change, UI breaks)
- RL Infra is painful: Docker, VMs, flaky APIs, no parallelism, no action sandboxing

πŸ‘‰ Scaling RL for LLM agents is bottlenecked not by models, but by diverse environments and experience data
🧡2/7

πŸ› οΈ Recipe for DreamGym

1. Reasoning-Based Experience Model:
- Takes in (state, action, task) and predicts next state + reward
- Uses CoT to explain why the transition happens β†’ more stable RL signals

2. Grounded Replay-Buffer:
- Each prediction is conditioned on similar reference trajectories
- Reduces hallucinations & keeps transitions causally consistent

Entropy-Based Task Curriculum:
- New tasks are automatically paired with reward signals in DreamGym!
- Finds tasks that are intermediate difficulty for the current policy
- Generates variations of these tasks to challenge the agent β†’ automatic curriculum learning

πŸ‘‰ Scalable β€œsynthetic environment” that produces consistent, diverse, reward-dense experience for RL training
🧡3/7

πŸ“Š Main Results
DreamGym beats or matches state-of-the-art RL β€” without any real-environment rollouts.

- On non-RL-ready environments such as WebArena, DreamGym is the only scalable method that makes RL work, +30%↑ over all baselines
- On RL-costly environments such as WebShop & ALFWorld, DreamGym matches PPO / GRPO trained with 80K real interactions β€” using 0 real interactions
- With Sim-to-Real transfer, DreamGym-S2R gives +40% performance while using 10Γ— less real data
- Works across different RL algorithms and model families β†’ both algorithm- and backbone-agnostic
🧡4/7

πŸ”₯ Why synthetic experience matters: efficiency, transfer, and scaling

⏱️ 4Γ— faster training than real-env RL β†’ reduce infra rollout time and GPU hours!
🌍 Synthetic training with WebShop tasks β†’ policy transferred to WebArena & ALFWorld!
πŸ“ˆ Smoother, faster learning curves; DreamGym-S2R converges highest and faster!
🧡5/7

πŸ” What makes DreamGym work?

- Removing replay grounding β†’ big drop in state consistency, higher hallucination
- Removing reasoning traces β†’ states become shallow, reward errors rise
- Removing task generation β†’ policy plateaus early

In short:
βœ… Reasoning = stable transitions
βœ… Replay buffer = factual grounding
βœ… Entropy-based curriculum = continuous improvement

LLM agents don’t just need more data β€” they need useful experience.
🧡6/7

🧩 Takeaway
The bottleneck in agent RL wasn’t the policy β€” it was the experience.
DreamGym shows that when we replace heterogeneous real-world rollouts with reasoning-grounded synthetic experience, RL becomes:
βœ… scalable & unified
βœ… less human engineering
βœ… generalizable and transferable

If we can synthesize the right experiences, we can train agents before they ever touch the real world.

πŸ“„ Paper: arxiv.org/abs/2511.03773
πŸ™Œ Feedback & collaborations welcome
🧡7/7

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling