Brendan Hogan Profile picture
AI/ML Research @morganstanley || PhD in CS @cornell 2024 || Abingdon Elementary 2005 https://t.co/kLIMj2xI03
Aug 24 7 tweets 2 min read
just pushed my first multi-turn RL environment to @PrimeIntellect

the setup: the model gets the story title + question from QuALITY (long stories, multiple-choice questions).

tts only tool: agentic RAG search over the story. Image this is an idea I have been toying with for a while but didn’t get around to doing. I had a paper last year about a twist on a RAG method and primarily experimented on this dataset.
Aug 13 26 tweets 7 min read
introducing qqWen: our fully open-sourced project (code+weights+data+detailed technical report) for full-stack finetuning (pretrain+SFT+RL) a series of models (1.5b, 3b, 7b, 14b & 32b) for a niche financial programming language called Q

All details below! Image
Image
Links:

Technical Report: arxiv.org/abs/2508.06813

Models +Data on HuggingFace: huggingface.co/collections/mo…

Full Code: github.com/morganstanley/…
Jul 11 9 tweets 2 min read
doing this now for my debate framework: gpt4.1 vs gpt4.1 advised by qwen 3B

gpt4.1 w qwens advice debates itself in elo/tournament style to get advantage

advantage is used to grpo qwen to give better advice

you can fine tune api models with rl'd context Image code: github.com/brendanhogan/D…
Jul 3 6 tweets 2 min read
other idea - if you assume it’s an open-weights model, can you learn an embedding-space context/prompt that improves performance?

I use/train a simple 3-layer network: it predicts from the last embedding of the prompt to a new embedding which is then fed into the frozen LLM Image code: github.com/brendanhogan/D…
May 23 9 tweets 2 min read
introducing: picoDeepResearch

multi-turn tool use + soft rewards + self-play + GRPO

You define the arena (report prompts + judging principles)

the model generates reports, uses tools (web search), then competes in round-robin battles judged by an LLM

winner gets the gradient Image
Image
Code:

all still just pytorch, no vLLM/TRL/etc

inspired by OpenAI’s Deep Research, but made “pico”, just enough to run real experiments, fine-tune real models, and build intuition

these results were using qwen3 -14Bgithub.com/brendanhogan/p…
May 15 6 tweets 2 min read
new project -  training a vLLM to solve CAPTCHAs with rl (grpo on F1 score).

introduced a “tool” for click_screen(x, y).

dataset is from cityscapes, F1 goes from 0.11 to ~0.78. details below Image

Image
code: github.com/brendanhogan/D…
Apr 27 7 tweets 2 min read
i added a basic implementation of deepseek’s grm/spct paper to the debate framework - just many rounds of principles/critiques for the scoring

similar early win rate vs gpt-4o-mini. and anecdotally, the arguments read much better and are less reward hacky to me. gh below Image
Image
Github:

this code is very much a work in progress - its pretty hard coded for the debate framework rngithub.com/brendanhogan/D…
Apr 1 7 tweets 8 min read
comedy has been achieved internally
qwen-2.5-7B is able to get the following win rate for jokes vs gpt-4o-mini
the prompt is roughly ‘generate a larry david style rant on {subject}’ and the judge determines who is funnier - more details and examples in comments Image code available here with the dataset: i do think its interesting that a 1.5B model couldnt ever win - whether the judge was itself or gpt-4o-minigithub.com/brendanhogan/D…
Mar 28 8 tweets 3 min read
new project: teaching LLMs to debate through self-play!
Using R1 style GRPO with LLM-judged round robin tournaments, qwen 2.5-1.5B learns to improve its arguments - going from winning 3% to 95% of debates against gpt-4o-mini. No hand-crafted rewards, just models learning from each other - code and more info below  🤖Image how it works: During training, the model generates multiple debate responses on the same topic. A judge (the base qwen2.5-1.5B model) LLM evaluates these against each other in a round-robin tournament, creating soft rewards that help the model learn which arguments work better. Github Code: (new branch)github.com/brendanhogan/D…