Latest Twitter Threads by @jaseweston on Thread Reader App

Nov 7 • 7 tweets • 4 min read

Scaling Agent Learning via Experience Synthesis
📝: arxiv.org/abs/2511.03773

Scaling training environments for RL by simulating them with reasoning LLMs!

Environment models + Replay-buffer + New tasks = cheap RL for any environments!

- Strong improvements over non-RL-ready environments and multiple model families!
- Works better in sim-2-real RL settings → Warm-start for high-cost environments
🧵1/7

🤖 Why is RL for language agents still so hard?

- Real environments are slow, fragile, and expensive - every rollout requires servers, resets, and long horizons
- Tasks are static & limited, so agents quickly overfit instead of learning to generalize
- Rewards are often sparse, noisy, or wrong (e.g., web pages change, UI breaks)
- RL Infra is painful: Docker, VMs, flaky APIs, no parallelism, no action sandboxing

👉 Scaling RL for LLM agents is bottlenecked not by models, but by diverse environments and experience data
🧵2/7

Oct 17 • 5 tweets • 3 min read

🌀Agent Learning via Early Experience🌀
📝: arxiv.org/abs/2510.08558
- SFT for agents is sparse; RL on long-horizons is hard
We provide new mid-training signals that work:
1) Implicit next state world modeling task
2) Self-reflection on alternate states
- Strong improvements over 8 environments and multiple model families
- Works well for subsequent RL!
🧵1/5

Recipe 👨‍🍳:

1) Implicit world modeling
Augments expert trajectories with alternative actions and predicted next states, training the policy to internalize transition dynamics before deployment.

2) Self-reflection
Augments expert actions with self-generated explanations, training the policy to reason about and revise its own decisions.

Both methods use K alternative actions proposed by the initial policy (LLM).
🧵2/5

Oct 13 • 5 tweets • 3 min read

Hybrid Reinforcement (HERO): When Reward Is Sparse, It’s Better to Be Dense 🦸‍♂️ 💪
📝: arxiv.org/abs/2510.07242

- HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method
- Tackles the brittleness of binary signals and the noise of pure reward models -> better results!

✔️ Stratified normalization anchors dense scores within verifier groups
✔️ Variance-aware weighting emphasizes harder, high-variance prompts
✔️ Stable + informative rewards, no drift

📈 Results:
🔥 +11.7 pts vs RM-only, +9.2 pts vs verifier-only on hard-to-verify reasoning tasks
🔥 Generalizes across Qwen and OctoThinker models
🔥 Works well when training with easy-to-verify/hard-to-verify/mixed samples.

Hybrid reward → stable, dense, reliable supervision, advancing reasoning RL

🧵(1/5)

Motivation & analysis🎯:

(a) Rule-based Reward: precise but brittle — gives 0 to many almost-correct answers.
✅ correctness but too strict.

(b) Reward Model: smooth but misaligned — sometimes rewards wrong answers (false positives) or underrates correct ones (false negatives).
✅ coverage but too loose.

Neither sparse nor dense alone is enough (see table).

🧵(2/5)

Jun 30 • 4 tweets • 2 min read

🌉 Bridging Offline & Online RL for LLMs 🌉
📝: arxiv.org/abs/2506.21495
New paper shows on verifiable & non-verifiable tasks:
- Online DPO & GRPO give similar performance.
- Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also.
- Offline DPO is way behind.
- Combining verifiable + non-verifiable works! Cross-transfer gains.
- Recipes for how to make this work.
🧵1/4

- Online DPO results in a 59.4% increase in AlpacaEval LC winrate & 56.2% in ArenaHard score compared to standard DPO. DPO is poor due to its offline nature.
- Online DPO achieves comparable performance to online GRPO.
- But more surprisingly so does semi-online DPO.
🧵2/4

Jul 30, 2024 • 7 tweets • 3 min read

🚨New paper!🚨
Meta-Rewarding LMs
- LM is actor, judge & meta-judge
- Learns to reward actions better by judging its own judgments (assigning *meta-rewards*)
- Improves acting & judging over time without human labels
... beats Self-Rewarding LMs

🧵(1/6) arxiv.org/abs/2407.19594

Recipe 👩‍🍳:
Iterate 3 steps:
(1) Create Actor data: generate responses & self-rewards (judgments) with LM
(2) Create Judge data: generate meta-rewards over judgments with LLM-as-a-Meta-Judge
(3) Train DPO on preference pairs to both learn to act (1) AND to judge (2)
🧵(2/6)

Aug 14, 2023 • 4 tweets • 2 min read

🚨New Paper 🚨
Self-Alignment with Instruction Backtranslation

- New method auto-labels web text with instructions & curates high quality ones for FTing

- Our model Humpback 🐋 outperforms LIMA, Claude, Guanaco, davinci-003 & Falcon-Inst

(1/4)🧵 https://t.co/9iU79bxDuoarxiv.org/abs/2308.06259

Recipe👩‍🍳: LLM finetuned on small seed data; access to web docs
(1) Self-augment: label each web doc with an instruction via the LLM
(2) Self-curate: label each new example with a quality score via the LLM
Then FT with the newly curated data.
Optionally Iterate.

(2/4) 🧵

Share this page!

Enter URL or ID to Unroll