How to get URL link on X (Twitter) App
π€ Why is RL for language agents still so hard?
Recipe π¨βπ³:
Motivation & analysisπ―:
- Online DPO results in a 59.4% increase in AlpacaEval LC winrate & 56.2% in ArenaHard score compared to standard DPO. DPO is poor due to its offline nature.
Recipe π©βπ³:
Recipeπ©βπ³: LLM finetuned on small seed data; access to web docs