i added a basic implementation of deepseek’s grm/spct paper to the debate framework - just many rounds of principles/critiques for the scoring
similar early win rate vs gpt-4o-mini. and anecdotally, the arguments read much better and are less reward hacky to me. gh below
Github:
this code is very much a work in progress - its pretty hard coded for the debate framework rngithub.com/brendanhogan/D…
Apr 1 • 7 tweets • 8 min read
comedy has been achieved internally
qwen-2.5-7B is able to get the following win rate for jokes vs gpt-4o-mini
the prompt is roughly ‘generate a larry david style rant on {subject}’ and the judge determines who is funnier - more details and examples in comments
code available here with the dataset: i do think its interesting that a 1.5B model couldnt ever win - whether the judge was itself or gpt-4o-minigithub.com/brendanhogan/D…
Mar 28 • 8 tweets • 3 min read
new project: teaching LLMs to debate through self-play!
Using R1 style GRPO with LLM-judged round robin tournaments, qwen 2.5-1.5B learns to improve its arguments - going from winning 3% to 95% of debates against gpt-4o-mini. No hand-crafted rewards, just models learning from each other - code and more info below 🤖
how it works: During training, the model generates multiple debate responses on the same topic. A judge (the base qwen2.5-1.5B model) LLM evaluates these against each other in a round-robin tournament, creating soft rewards that help the model learn which arguments work better. Github Code: (new branch)github.com/brendanhogan/D…