Brendan Hogan Profile picture
AI/ML Research @morganstanley || PhD in CS @cornell 2024 || Abingdon Elementary 2005
May 15 6 tweets 2 min read
new project -  training a vLLM to solve CAPTCHAs with rl (grpo on F1 score).

introduced a “tool” for click_screen(x, y).

dataset is from cityscapes, F1 goes from 0.11 to ~0.78. details below Image

Image
code: github.com/brendanhogan/D…
Apr 27 7 tweets 2 min read
i added a basic implementation of deepseek’s grm/spct paper to the debate framework - just many rounds of principles/critiques for the scoring

similar early win rate vs gpt-4o-mini. and anecdotally, the arguments read much better and are less reward hacky to me. gh below Image
Image
Github:

this code is very much a work in progress - its pretty hard coded for the debate framework rngithub.com/brendanhogan/D…
Apr 1 7 tweets 8 min read
comedy has been achieved internally
qwen-2.5-7B is able to get the following win rate for jokes vs gpt-4o-mini
the prompt is roughly ‘generate a larry david style rant on {subject}’ and the judge determines who is funnier - more details and examples in comments Image code available here with the dataset: i do think its interesting that a 1.5B model couldnt ever win - whether the judge was itself or gpt-4o-minigithub.com/brendanhogan/D…
Mar 28 8 tweets 3 min read
new project: teaching LLMs to debate through self-play!
Using R1 style GRPO with LLM-judged round robin tournaments, qwen 2.5-1.5B learns to improve its arguments - going from winning 3% to 95% of debates against gpt-4o-mini. No hand-crafted rewards, just models learning from each other - code and more info below  🤖Image how it works: During training, the model generates multiple debate responses on the same topic. A judge (the base qwen2.5-1.5B model) LLM evaluates these against each other in a round-robin tournament, creating soft rewards that help the model learn which arguments work better. Github Code: (new branch)github.com/brendanhogan/D…