Jason Weston Profile picture
Nov 7 β€’ 7 tweets β€’ 4 min read β€’ Read on X
Scaling Agent Learning via Experience Synthesis
πŸ“: arxiv.org/abs/2511.03773

Scaling training environments for RL by simulating them with reasoning LLMs!

Environment models + Replay-buffer + New tasks = cheap RL for any environments!

- Strong improvements over non-RL-ready environments and multiple model families!
- Works better in sim-2-real RL settings β†’ Warm-start for high-cost environments
🧡1/7Image
πŸ€– Why is RL for language agents still so hard?

- Real environments are slow, fragile, and expensive - every rollout requires servers, resets, and long horizons
- Tasks are static & limited, so agents quickly overfit instead of learning to generalize
- Rewards are often sparse, noisy, or wrong (e.g., web pages change, UI breaks)
- RL Infra is painful: Docker, VMs, flaky APIs, no parallelism, no action sandboxing

πŸ‘‰ Scaling RL for LLM agents is bottlenecked not by models, but by diverse environments and experience data
🧡2/7Image
πŸ› οΈ Recipe for DreamGym

1. Reasoning-Based Experience Model:
- Takes in (state, action, task) and predicts next state + reward
- Uses CoT to explain why the transition happens β†’ more stable RL signals

2. Grounded Replay-Buffer:
- Each prediction is conditioned on similar reference trajectories
- Reduces hallucinations & keeps transitions causally consistent

Entropy-Based Task Curriculum:
- New tasks are automatically paired with reward signals in DreamGym!
- Finds tasks that are intermediate difficulty for the current policy
- Generates variations of these tasks to challenge the agent β†’ automatic curriculum learning

πŸ‘‰ Scalable β€œsynthetic environment” that produces consistent, diverse, reward-dense experience for RL training
🧡3/7Image
πŸ“Š Main Results
DreamGym beats or matches state-of-the-art RL β€” without any real-environment rollouts.

- On non-RL-ready environments such as WebArena, DreamGym is the only scalable method that makes RL work, +30%↑ over all baselines
- On RL-costly environments such as WebShop & ALFWorld, DreamGym matches PPO / GRPO trained with 80K real interactions β€” using 0 real interactions
- With Sim-to-Real transfer, DreamGym-S2R gives +40% performance while using 10Γ— less real data
- Works across different RL algorithms and model families β†’ both algorithm- and backbone-agnostic
🧡4/7Image
πŸ”₯ Why synthetic experience matters: efficiency, transfer, and scaling

⏱️ 4Γ— faster training than real-env RL β†’ reduce infra rollout time and GPU hours!
🌍 Synthetic training with WebShop tasks β†’ policy transferred to WebArena & ALFWorld!
πŸ“ˆ Smoother, faster learning curves; DreamGym-S2R converges highest and faster!
🧡5/7Image
πŸ” What makes DreamGym work?

- Removing replay grounding β†’ big drop in state consistency, higher hallucination
- Removing reasoning traces β†’ states become shallow, reward errors rise
- Removing task generation β†’ policy plateaus early

In short:
βœ… Reasoning = stable transitions
βœ… Replay buffer = factual grounding
βœ… Entropy-based curriculum = continuous improvement

LLM agents don’t just need more data β€” they need useful experience.
🧡6/7Image
🧩 Takeaway
The bottleneck in agent RL wasn’t the policy β€” it was the experience.
DreamGym shows that when we replace heterogeneous real-world rollouts with reasoning-grounded synthetic experience, RL becomes:
βœ… scalable & unified
βœ… less human engineering
βœ… generalizable and transferable

If we can synthesize the right experiences, we can train agents before they ever touch the real world.

πŸ“„ Paper: arxiv.org/abs/2511.03773
πŸ™Œ Feedback & collaborations welcome
🧡7/7

β€’ β€’ β€’

Missing some Tweet in this thread? You can try to force a refresh
γ€€

Keep Current with Jason Weston

Jason Weston Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @jaseweston

Oct 17
πŸŒ€Agent Learning via Early ExperienceπŸŒ€
πŸ“: arxiv.org/abs/2510.08558
- SFT for agents is sparse; RL on long-horizons is hard
We provide new mid-training signals that work:
1) Implicit next state world modeling task
2) Self-reflection on alternate states
- Strong improvements over 8 environments and multiple model families
- Works well for subsequent RL!
🧡1/5Image
Recipe πŸ‘¨β€πŸ³:

1) Implicit world modeling
Augments expert trajectories with alternative actions and predicted next states, training the policy to internalize transition dynamics before deployment.

2) Self-reflection
Augments expert actions with self-generated explanations, training the policy to reason about and revise its own decisions.

Both methods use K alternative actions proposed by the initial policy (LLM).
🧡2/5Image
Results: across 8 benchmarks we see performance improvements over instruction-tuned model prompting or imitation learning for the task.
🧡3/5 Image
Read 5 tweets
Oct 13
Hybrid Reinforcement (HERO): When Reward Is Sparse, It’s Better to Be Dense πŸ¦Έβ€β™‚οΈ πŸ’ͺ
πŸ“: arxiv.org/abs/2510.07242

- HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method
- Tackles the brittleness of binary signals and the noise of pure reward models -> better results!

βœ”οΈ Stratified normalization anchors dense scores within verifier groups
βœ”οΈ Variance-aware weighting emphasizes harder, high-variance prompts
βœ”οΈ Stable + informative rewards, no drift

πŸ“ˆ Results:
πŸ”₯ +11.7 pts vs RM-only, +9.2 pts vs verifier-only on hard-to-verify reasoning tasks
πŸ”₯ Generalizes across Qwen and OctoThinker models
πŸ”₯ Works well when training with easy-to-verify/hard-to-verify/mixed samples.

Hybrid reward β†’ stable, dense, reliable supervision, advancing reasoning RL

🧡(1/5)Image
Motivation & analysis🎯:

(a) Rule-based Reward: precise but brittle β€” gives 0 to many almost-correct answers.
βœ… correctness but too strict.

(b) Reward Model: smooth but misaligned β€” sometimes rewards wrong answers (false positives) or underrates correct ones (false negatives).
βœ… coverage but too loose.

Neither sparse nor dense alone is enough (see table).

🧡(2/5)Image
HERO: merges both β€” keeps verifier precision βœ… and adds RM nuance πŸ’‘ β†’ fewer false negatives, denser supervision, stronger gradients.

Reward Design βš–οΈ + Training Recipe πŸ‘¨β€πŸ³: making sparse signals dense (and reliable).

1️⃣ Stratified normalization - rescale reward-model scores within verifier groups.
β†’ preserves correctness while adding fine-grained gradients.

2️⃣ Variance-aware weighting - boost diverse rollouts, down-weight trivial ones.
 → focuses learning where signals are most informative.

Dense feedback introduces reward differences even in all-0/all-1 verifier batches β†’ no gradient dead zones.

🧡(3/5)Image
Read 5 tweets
Jun 30
πŸŒ‰ Bridging Offline & Online RL for LLMs πŸŒ‰
πŸ“: arxiv.org/abs/2506.21495
New paper shows on verifiable & non-verifiable tasks:
- Online DPO & GRPO give similar performance.
- Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also.
- Offline DPO is way behind.
- Combining verifiable + non-verifiable works! Cross-transfer gains.
- Recipes for how to make this work.
🧡1/4Image
- Online DPO results in a 59.4% increase in AlpacaEval LC winrate & 56.2% in ArenaHard score compared to standard DPO. DPO is poor due to its offline nature.
- Online DPO achieves comparable performance to online GRPO.
- But more surprisingly so does semi-online DPO.
🧡2/4 Image
We find similar results on verifiable tasks as well.
Semi-online DPO even performs a little bit better on some tasks.
🧡3/4 Image
Read 4 tweets
Jul 30, 2024
🚨New paper!🚨
Meta-Rewarding LMs
- LM is actor, judge & meta-judge
- Learns to reward actions better by judging its own judgments (assigning *meta-rewards*)
- Improves acting & judging over time without human labels
... beats Self-Rewarding LMs

🧡(1/6) arxiv.org/abs/2407.19594
Image
Recipe πŸ‘©β€πŸ³:
Iterate 3 steps:
(1) Create Actor data: generate responses & self-rewards (judgments) with LM
(2) Create Judge data: generate meta-rewards over judgments with LLM-as-a-Meta-Judge
(3) Train DPO on preference pairs to both learn to act (1) AND to judge (2)
🧡(2/6) Image
How does an LLM judge judgments? We use LLM-as-a-Meta-Judge, see prompt in figure.
- Make N judgments for a given pair of responses & calc pairwise meta-judgments
- Compute Elo score of the judgments via this matrix
- Create LLM-as-a-judge preference pairs via Elo scores
🧡(3/6) Image
Read 7 tweets
Aug 14, 2023
🚨New Paper 🚨
Self-Alignment with Instruction Backtranslation

- New method auto-labels web text with instructions & curates high quality ones for FTing

- Our model Humpback πŸ‹ outperforms LIMA, Claude, Guanaco, davinci-003 & Falcon-Inst


(1/4)🧡 https://t.co/9iU79bxDuoarxiv.org/abs/2308.06259
Image
RecipeπŸ‘©β€πŸ³: LLM finetuned on small seed data; access to web docs
(1) Self-augment: label each web doc with an instruction via the LLM
(2) Self-curate: label each new example with a quality score via the LLM
Then FT with the newly curated data.
Optionally Iterate.

(2/4) 🧡 Image
The resulting data is remarkably high quality/impactful for training, even though it’s through self-alignment, outperforming other instruction tuning datasets for the same data size (πŸ‹ > πŸͺ)

(3/4) 🧡 Image
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(