Latest Twitter Threads by @brendanh0gan on Thread Reader App

May 15 • 6 tweets • 2 min read

new project - training a vLLM to solve CAPTCHAs with rl (grpo on F1 score).

introduced a “tool” for click_screen(x, y).

dataset is from cityscapes, F1 goes from 0.11 to ~0.78. details below

code: github.com/brendanhogan/D…

Apr 27 • 7 tweets • 2 min read

i added a basic implementation of deepseek’s grm/spct paper to the debate framework - just many rounds of principles/critiques for the scoring

similar early win rate vs gpt-4o-mini. and anecdotally, the arguments read much better and are less reward hacky to me. gh below

Github:

this code is very much a work in progress - its pretty hard coded for the debate framework rngithub.com/brendanhogan/D…

Apr 1 • 7 tweets • 8 min read

comedy has been achieved internally
qwen-2.5-7B is able to get the following win rate for jokes vs gpt-4o-mini
the prompt is roughly ‘generate a larry david style rant on {subject}’ and the judge determines who is funnier - more details and examples in comments

code available here with the dataset: i do think its interesting that a 1.5B model couldnt ever win - whether the judge was itself or gpt-4o-minigithub.com/brendanhogan/D…

Mar 28 • 8 tweets • 3 min read

new project: teaching LLMs to debate through self-play!
Using R1 style GRPO with LLM-judged round robin tournaments, qwen 2.5-1.5B learns to improve its arguments - going from winning 3% to 95% of debates against gpt-4o-mini. No hand-crafted rewards, just models learning from each other - code and more info below 🤖

how it works: During training, the model generates multiple debate responses on the same topic. A judge (the base qwen2.5-1.5B model) LLM evaluates these against each other in a round-robin tournament, creating soft rewards that help the model learn which arguments work better. Github Code: (new branch)github.com/brendanhogan/D…

Share this page!

Enter URL or ID to Unroll