Jason Weston Profile picture
@Meta+NYU. NLP from scratch(Pretrain+FT LLM) 2008,MemNet (pre-Transformer) 2015, DrQA(pre-RAG) 2017, BlenderBot(dialog pre-ChatGPT) 2018+, Self-Rewarding+more!
Nov 7 β€’ 7 tweets β€’ 4 min read
Scaling Agent Learning via Experience Synthesis
πŸ“: arxiv.org/abs/2511.03773

Scaling training environments for RL by simulating them with reasoning LLMs!

Environment models + Replay-buffer + New tasks = cheap RL for any environments!

- Strong improvements over non-RL-ready environments and multiple model families!
- Works better in sim-2-real RL settings β†’ Warm-start for high-cost environments
🧡1/7Image πŸ€– Why is RL for language agents still so hard?

- Real environments are slow, fragile, and expensive - every rollout requires servers, resets, and long horizons
- Tasks are static & limited, so agents quickly overfit instead of learning to generalize
- Rewards are often sparse, noisy, or wrong (e.g., web pages change, UI breaks)
- RL Infra is painful: Docker, VMs, flaky APIs, no parallelism, no action sandboxing

πŸ‘‰ Scaling RL for LLM agents is bottlenecked not by models, but by diverse environments and experience data
🧡2/7Image
Oct 17 β€’ 5 tweets β€’ 3 min read
πŸŒ€Agent Learning via Early ExperienceπŸŒ€
πŸ“: arxiv.org/abs/2510.08558
- SFT for agents is sparse; RL on long-horizons is hard
We provide new mid-training signals that work:
1) Implicit next state world modeling task
2) Self-reflection on alternate states
- Strong improvements over 8 environments and multiple model families
- Works well for subsequent RL!
🧡1/5Image Recipe πŸ‘¨β€πŸ³:

1) Implicit world modeling
Augments expert trajectories with alternative actions and predicted next states, training the policy to internalize transition dynamics before deployment.

2) Self-reflection
Augments expert actions with self-generated explanations, training the policy to reason about and revise its own decisions.

Both methods use K alternative actions proposed by the initial policy (LLM).
🧡2/5Image
Oct 13 β€’ 5 tweets β€’ 3 min read
Hybrid Reinforcement (HERO): When Reward Is Sparse, It’s Better to Be Dense πŸ¦Έβ€β™‚οΈ πŸ’ͺ
πŸ“: arxiv.org/abs/2510.07242

- HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method
- Tackles the brittleness of binary signals and the noise of pure reward models -> better results!

βœ”οΈ Stratified normalization anchors dense scores within verifier groups
βœ”οΈ Variance-aware weighting emphasizes harder, high-variance prompts
βœ”οΈ Stable + informative rewards, no drift

πŸ“ˆ Results:
πŸ”₯ +11.7 pts vs RM-only, +9.2 pts vs verifier-only on hard-to-verify reasoning tasks
πŸ”₯ Generalizes across Qwen and OctoThinker models
πŸ”₯ Works well when training with easy-to-verify/hard-to-verify/mixed samples.

Hybrid reward β†’ stable, dense, reliable supervision, advancing reasoning RL

🧡(1/5)Image Motivation & analysis🎯:

(a) Rule-based Reward: precise but brittle β€” gives 0 to many almost-correct answers.
βœ… correctness but too strict.

(b) Reward Model: smooth but misaligned β€” sometimes rewards wrong answers (false positives) or underrates correct ones (false negatives).
βœ… coverage but too loose.

Neither sparse nor dense alone is enough (see table).

🧡(2/5)Image
Jun 30 β€’ 4 tweets β€’ 2 min read
πŸŒ‰ Bridging Offline & Online RL for LLMs πŸŒ‰
πŸ“: arxiv.org/abs/2506.21495
New paper shows on verifiable & non-verifiable tasks:
- Online DPO & GRPO give similar performance.
- Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also.
- Offline DPO is way behind.
- Combining verifiable + non-verifiable works! Cross-transfer gains.
- Recipes for how to make this work.
🧡1/4Image - Online DPO results in a 59.4% increase in AlpacaEval LC winrate & 56.2% in ArenaHard score compared to standard DPO. DPO is poor due to its offline nature.
- Online DPO achieves comparable performance to online GRPO.
- But more surprisingly so does semi-online DPO.
🧡2/4 Image
Jul 30, 2024 β€’ 7 tweets β€’ 3 min read
🚨New paper!🚨
Meta-Rewarding LMs
- LM is actor, judge & meta-judge
- Learns to reward actions better by judging its own judgments (assigning *meta-rewards*)
- Improves acting & judging over time without human labels
... beats Self-Rewarding LMs

🧡(1/6) arxiv.org/abs/2407.19594
Image Recipe πŸ‘©β€πŸ³:
Iterate 3 steps:
(1) Create Actor data: generate responses & self-rewards (judgments) with LM
(2) Create Judge data: generate meta-rewards over judgments with LLM-as-a-Meta-Judge
(3) Train DPO on preference pairs to both learn to act (1) AND to judge (2)
🧡(2/6) Image
Aug 14, 2023 β€’ 4 tweets β€’ 2 min read
🚨New Paper 🚨
Self-Alignment with Instruction Backtranslation

- New method auto-labels web text with instructions & curates high quality ones for FTing

- Our model Humpback πŸ‹ outperforms LIMA, Claude, Guanaco, davinci-003 & Falcon-Inst


(1/4)🧡 https://t.co/9iU79bxDuoarxiv.org/abs/2308.06259
Image RecipeπŸ‘©β€πŸ³: LLM finetuned on small seed data; access to web docs
(1) Self-augment: label each web doc with an instruction via the LLM
(2) Self-curate: label each new example with a quality score via the LLM
Then FT with the newly curated data.
Optionally Iterate.

(2/4) 🧡 Image