Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Jason Weston

@jaseweston

Nov 7 • 7 tweets • 4 min read • Read on X

Scrolly

Scaling Agent Learning via Experience Synthesis
📝: arxiv.org/abs/2511.03773

Scaling training environments for RL by simulating them with reasoning LLMs!

Environment models + Replay-buffer + New tasks = cheap RL for any environments!

- Strong improvements over non-RL-ready environments and multiple model families!
- Works better in sim-2-real RL settings → Warm-start for high-cost environments
🧵1/7

🤖 Why is RL for language agents still so hard?

- Real environments are slow, fragile, and expensive - every rollout requires servers, resets, and long horizons
- Tasks are static & limited, so agents quickly overfit instead of learning to generalize
- Rewards are often sparse, noisy, or wrong (e.g., web pages change, UI breaks)
- RL Infra is painful: Docker, VMs, flaky APIs, no parallelism, no action sandboxing

👉 Scaling RL for LLM agents is bottlenecked not by models, but by diverse environments and experience data
🧵2/7

🛠️ Recipe for DreamGym

1. Reasoning-Based Experience Model:
- Takes in (state, action, task) and predicts next state + reward
- Uses CoT to explain why the transition happens → more stable RL signals

2. Grounded Replay-Buffer:
- Each prediction is conditioned on similar reference trajectories
- Reduces hallucinations & keeps transitions causally consistent

Entropy-Based Task Curriculum:
- New tasks are automatically paired with reward signals in DreamGym!
- Finds tasks that are intermediate difficulty for the current policy
- Generates variations of these tasks to challenge the agent → automatic curriculum learning

👉 Scalable “synthetic environment” that produces consistent, diverse, reward-dense experience for RL training
🧵3/7

📊 Main Results
DreamGym beats or matches state-of-the-art RL — without any real-environment rollouts.

- On non-RL-ready environments such as WebArena, DreamGym is the only scalable method that makes RL work, +30%↑ over all baselines
- On RL-costly environments such as WebShop & ALFWorld, DreamGym matches PPO / GRPO trained with 80K real interactions — using 0 real interactions
- With Sim-to-Real transfer, DreamGym-S2R gives +40% performance while using 10× less real data
- Works across different RL algorithms and model families → both algorithm- and backbone-agnostic
🧵4/7

🔥 Why synthetic experience matters: efficiency, transfer, and scaling

⏱️ 4× faster training than real-env RL → reduce infra rollout time and GPU hours!
🌍 Synthetic training with WebShop tasks → policy transferred to WebArena & ALFWorld!
📈 Smoother, faster learning curves; DreamGym-S2R converges highest and faster!
🧵5/7

🔍 What makes DreamGym work?

- Removing replay grounding → big drop in state consistency, higher hallucination
- Removing reasoning traces → states become shallow, reward errors rise
- Removing task generation → policy plateaus early

In short:
✅ Reasoning = stable transitions
✅ Replay buffer = factual grounding
✅ Entropy-based curriculum = continuous improvement

LLM agents don’t just need more data — they need useful experience.
🧵6/7

🧩 Takeaway
The bottleneck in agent RL wasn’t the policy — it was the experience.
DreamGym shows that when we replace heterogeneous real-world rollouts with reasoning-grounded synthetic experience, RL becomes:
✅ scalable & unified
✅ less human engineering
✅ generalizable and transferable

If we can synthesize the right experiences, we can train agents before they ever touch the real world.

📄 Paper: arxiv.org/abs/2511.03773
🙌 Feedback & collaborations welcome
🧵7/7

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @jaseweston

Jason Weston

@jaseweston

Oct 17

🌀Agent Learning via Early Experience🌀
📝: arxiv.org/abs/2510.08558
- SFT for agents is sparse; RL on long-horizons is hard
We provide new mid-training signals that work:
1) Implicit next state world modeling task
2) Self-reflection on alternate states
- Strong improvements over 8 environments and multiple model families
- Works well for subsequent RL!
🧵1/5

Recipe 👨‍🍳:

1) Implicit world modeling
Augments expert trajectories with alternative actions and predicted next states, training the policy to internalize transition dynamics before deployment.

2) Self-reflection
Augments expert actions with self-generated explanations, training the policy to reason about and revise its own decisions.

Both methods use K alternative actions proposed by the initial policy (LLM).
🧵2/5

Results: across 8 benchmarks we see performance improvements over instruction-tuned model prompting or imitation learning for the task.
🧵3/5

Read 5 tweets

Jason Weston

@jaseweston

Oct 13

Hybrid Reinforcement (HERO): When Reward Is Sparse, It’s Better to Be Dense 🦸‍♂️ 💪
📝: arxiv.org/abs/2510.07242

- HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method
- Tackles the brittleness of binary signals and the noise of pure reward models -> better results!

✔️ Stratified normalization anchors dense scores within verifier groups
✔️ Variance-aware weighting emphasizes harder, high-variance prompts
✔️ Stable + informative rewards, no drift

📈 Results:
🔥 +11.7 pts vs RM-only, +9.2 pts vs verifier-only on hard-to-verify reasoning tasks
🔥 Generalizes across Qwen and OctoThinker models
🔥 Works well when training with easy-to-verify/hard-to-verify/mixed samples.

Hybrid reward → stable, dense, reliable supervision, advancing reasoning RL

🧵(1/5)

Motivation & analysis🎯:

(a) Rule-based Reward: precise but brittle — gives 0 to many almost-correct answers.
✅ correctness but too strict.

(b) Reward Model: smooth but misaligned — sometimes rewards wrong answers (false positives) or underrates correct ones (false negatives).
✅ coverage but too loose.

Neither sparse nor dense alone is enough (see table).

🧵(2/5)

HERO: merges both — keeps verifier precision ✅ and adds RM nuance 💡 → fewer false negatives, denser supervision, stronger gradients.

Reward Design ⚖️ + Training Recipe 👨‍🍳: making sparse signals dense (and reliable).

1️⃣ Stratified normalization - rescale reward-model scores within verifier groups.
→ preserves correctness while adding fine-grained gradients.

2️⃣ Variance-aware weighting - boost diverse rollouts, down-weight trivial ones.
→ focuses learning where signals are most informative.

Dense feedback introduces reward differences even in all-0/all-1 verifier batches → no gradient dead zones.

🧵(3/5)

Read 5 tweets

Jason Weston

@jaseweston

Jun 30

🌉 Bridging Offline & Online RL for LLMs 🌉
📝: arxiv.org/abs/2506.21495
New paper shows on verifiable & non-verifiable tasks:
- Online DPO & GRPO give similar performance.
- Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also.
- Offline DPO is way behind.
- Combining verifiable + non-verifiable works! Cross-transfer gains.
- Recipes for how to make this work.
🧵1/4

- Online DPO results in a 59.4% increase in AlpacaEval LC winrate & 56.2% in ArenaHard score compared to standard DPO. DPO is poor due to its offline nature.
- Online DPO achieves comparable performance to online GRPO.
- But more surprisingly so does semi-online DPO.
🧵2/4

We find similar results on verifiable tasks as well.
Semi-online DPO even performs a little bit better on some tasks.
🧵3/4

Read 4 tweets

Jason Weston

@jaseweston

Jul 30, 2024

🚨New paper!🚨
Meta-Rewarding LMs
- LM is actor, judge & meta-judge
- Learns to reward actions better by judging its own judgments (assigning *meta-rewards*)
- Improves acting & judging over time without human labels
... beats Self-Rewarding LMs

🧵(1/6) arxiv.org/abs/2407.19594

Recipe 👩‍🍳:
Iterate 3 steps:
(1) Create Actor data: generate responses & self-rewards (judgments) with LM
(2) Create Judge data: generate meta-rewards over judgments with LLM-as-a-Meta-Judge
(3) Train DPO on preference pairs to both learn to act (1) AND to judge (2)
🧵(2/6)

How does an LLM judge judgments? We use LLM-as-a-Meta-Judge, see prompt in figure.
- Make N judgments for a given pair of responses & calc pairwise meta-judgments
- Compute Elo score of the judgments via this matrix
- Create LLM-as-a-judge preference pairs via Elo scores
🧵(3/6)

Read 7 tweets

Jason Weston

@jaseweston

Aug 14, 2023

🚨New Paper 🚨
Self-Alignment with Instruction Backtranslation

- New method auto-labels web text with instructions & curates high quality ones for FTing

- Our model Humpback 🐋 outperforms LIMA, Claude, Guanaco, davinci-003 & Falcon-Inst

(1/4)🧵 https://t.co/9iU79bxDuoarxiv.org/abs/2308.06259

Recipe👩‍🍳: LLM finetuned on small seed data; access to web docs
(1) Self-augment: label each web doc with an instruction via the LLM
(2) Self-curate: label each new example with a quality score via the LLM
Then FT with the newly curated data.
Optionally Iterate.

(2/4) 🧵

The resulting data is remarkably high quality/impactful for training, even though it’s through self-alignment, outperforming other instruction tuning datasets for the same data size (🐋 > 🐪)

(3/4) 🧵

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Jason Weston

Try unrolling a thread yourself!

More from @jaseweston

Jason Weston

Jason Weston

Jason Weston

Jason Weston

Jason Weston

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!