Brendan Hogan Profile picture
ml research scientist @morganstanley || phd in cs @cornell 2024
Dec 18, 2025 4 tweets 3 min read
🎄Advent of Small ML: Day 18 pTopic: GRPO Training with 1 million Persona Judges (Optimizing for Your Audience)

yesterday i showed how we can simulate 1M personas to "poll" the country. today i wanted to close the loop: what if we use those personas as the judge in a GRPO training loop?

the idea is simple: instead of training a model for generic "quality" (which usually just means "what an RLHF rater likes"), we can train it to specifically resonate with a targeted slice of the population.

so i took the simulation engine from yesterday and turned it into a reward function.

the model generates 4 tweets about "The Future of Work"

A jury of 50 personas (filtered to a specific demographic) votes in a round-robin tournament

Win rate = Reward Signal for GRPO

for this run, i set the target demographic to "Young Professionals (18-29) in Coastal Cities (NY, CA)".

the result? you can watch the model learn to optimize its messaging for that demographic

it started losing to GPT-4.1, but after ~150 steps of GRPO, it learned the specific tone/framing that group likes, hitting a 62% win rate against GPT-4.1 within that demographic

i updated the dashboard from yesterday so you can visualize the training run (video and explanation below)

you can scrub through the training steps and watch the map turn "blue" (meaning our model wins) specifically in the target states

it’s a cool proof of concept for "Demographic Alignment", optimizing models not just for "humans" broadly, but for specific communities - or for using specific demographics as the judges to optimize for

code + demo in commentsImage code: github.com/brendanhogan/2…
Dec 16, 2025 4 tweets 4 min read
🎄 Advent of Small ML: Day 16 🎄 Topic: ENGRAM (Skill → Cartridge) for Wiki Search (Continual Learning for a multi-turn tool use environment)

huge thank you to @willccbb and @PrimeIntellect for building the wiki environment, verifiers and the environments hub - it makes it super easy to try out all kinds of ideas like this in a controllable, repeatable and measurable way!

Environment:

how the environment works is the LLM is presented with a trivia question that can be derived from a wikipedia page, and a corpus of wikipedia pages (and their resulting embedding in a ChromaDB database)

the llm has three tools - search_pages, view_sections, read_section. It has to learn strategies: when to search broadly vs. specifically, how to navigate structure, and when to stop - as to best answer its question

the success of the LLM in answering the question is then reviewed using llm-as-a-judge

Method:

(ENGRAM): I use the same "Conscious Practice → Muscle Memory" loop:

Phase A (Skill): The agent tries to solve questions. I use the Prime Intellect verifiers library to judge the answers (GPT-4.1). Based on feedback, I then update a text-based "Strategy Guide."

Phase B (Cartridge): Every N steps, i distill that text guide into a compressed Cartridge (KV cache vectors).

Phase C: Reset the guide, keep the cartridge.

Results:

On a small test set, the model started at 20% accuracy (it didn't know how to use the tools effectively). After the skill refinement and cartridge distillation loop, it peaked at 40% accuracy (full results below)

definitely a small test - but it successfully encoded "search strategies" into a compressed vector format that persists without fine-tuning.

repo + results + skill example belowImage code: github.com/brendanhogan/2…

wiki environment: app.primeintellect.ai/dashboard/envi…
Dec 7, 2025 5 tweets 3 min read
🎄 Advent of Small ML: Day 7 🎄 Topic: Entropy-Based Rewards (Forcing the model to "keep its options open")

there’s a fascinating recent paper (Layer by Layer: Uncovering Hidden Representations in Language Models - arxiv.org/abs/2502.02013 - shown to me by @aditjain1980) showing that reasoning models tend to have higher entropy in their middle layers

basically, instead of collapsing to an answer early, they keep more possibilities "alive" in their hidden states while thinking.

it made me think - if high entropy correlates with better reasoning, can we force the model to reason better by explicitly rewarding high entropy?

so I added a Matrix-based Entropy reward (Rényi entropy on eigenvalues) to GRPO training on the MATH500 dataset, rewarding the amount of entropy on the middle 10 layers of qwen 2.5 7b

the initial results were mixed.

when I just rewarded entropy, the model definitely increased its entropy... but it didn't get better at math. It just learned to be "confused" and exploratory without actually converging on answers.

It produced some pretty funny outputs, going on weird tangents and "overthinking" simple problems (examples below)

But then I changed the rewarding rule: Only reward high entropy if the final answer is CORRECT.

this worked (sort of) - it gave a 2.5% performance boost over the baseline.

this is a proof of concept that we can use RL to shape the internal dynamics of how a model thinks, not just its final output tokens.

Repo + Plots belowImage Code: github.com/brendanhogan/2…
Dec 3, 2025 7 tweets 3 min read
🎄 Advent of Small ML: Day 3 Topic: Adversarial Unsupervised GRPO (Automated Red Teaming) 🎄

yesterday, I showed how to train a vlm without labels using a cyclegan-ish style loop. today I wanted to expand on that and make it harder/better

instead of training on random images, can we have an active adversary that hunts for the model's blind spots?

the hypothesis: if we train the model against an adversary that generates "hard" images, the model should become more robust and generalize better than just seeing random data.

the experiment: I set up a competitive game (gan-style) between two models:

the base model: tries to describe images so they can be recreated (reward = high cosine similarity) (same as yesterday)

the adversary: tries to generate prompts for images that the base model fails to describe well (reward = low cosine similarity).

basically, the adversary acts as an automated red team, constantly searching for the base model's weaknesses.

it actually beat the non-adversarial baseline from yesterday in the early stages, though they eventually converged to similar levels.

Repo + Plots + more results belowImage repo: github.com/brendanhogan/2…
Dec 2, 2025 4 tweets 3 min read
🎄 Advent of Small ML: Day 2 : Teaching a VLM to reason about charts with Unsupervised GRPO🎄

a big use case for VLMs is parsing chart data for Q&A. CharXiv from @zwcolin is a great recent benchmark for this, but I had a question: Can we do this in an unsupervised way?

If we don't need labeled Q/A pairs for every chart, we can leverage data much more cheaply.

The inspiration came from CycleGAN and the idea of using a numerical loss as a proxy for how "good" the text generated by the VLM actually is. (Big inspo here is @rosmine_b’s SVG work - go check it out).

The Experiment: I set up a loop to treat the VLM like an autoencoder:

1. Take a chart image.

2. Prompt the VLM to describe it.

3. Feed that description into an image generator (Flux Schnell).

4. Measure the cosine similarity between the regenerated image and the original (using DINO)

This similarity score becomes the reward signal for GRPO. The logic: to accurately recreate the image, the model must extract the most salient features in its description.

The methods: I used Qwen 2.5 3B and DINOv2 for the embeddings (to capture semantic info, not just pixels).

Results for the Proxy Task: The model consistently improved its cosine similarity scores.

Results for Transfer Learning : Despite seeing zero labeled questions during training, this transferred to CharXiv reasoning questions, showing an ~7% improvement in pass@1 at the peak.

It’s a small experiment with a small model, but I think the result is really cool: the model got better at reasoning without seeing a single reasoning label.

I’m really interested in exploring more of these CycleGAN-esque / "LLM as Autoencoder" domains to escape the need for labeled data.

Repo + Plots in the comments.Image Github: github.com/brendanhogan/2…
Aug 24, 2025 7 tweets 2 min read
just pushed my first multi-turn RL environment to @PrimeIntellect

the setup: the model gets the story title + question from QuALITY (long stories, multiple-choice questions).

tts only tool: agentic RAG search over the story. Image this is an idea I have been toying with for a while but didn’t get around to doing. I had a paper last year about a twist on a RAG method and primarily experimented on this dataset.
Aug 13, 2025 26 tweets 7 min read
introducing qqWen: our fully open-sourced project (code+weights+data+detailed technical report) for full-stack finetuning (pretrain+SFT+RL) a series of models (1.5b, 3b, 7b, 14b & 32b) for a niche financial programming language called Q

All details below! Image
Image
Links:

Technical Report: arxiv.org/abs/2508.06813

Models +Data on HuggingFace: huggingface.co/collections/mo…

Full Code: github.com/morganstanley/…
Jul 11, 2025 9 tweets 2 min read
doing this now for my debate framework: gpt4.1 vs gpt4.1 advised by qwen 3B

gpt4.1 w qwens advice debates itself in elo/tournament style to get advantage

advantage is used to grpo qwen to give better advice

you can fine tune api models with rl'd context Image code: github.com/brendanhogan/D…
Jul 3, 2025 6 tweets 2 min read
other idea - if you assume it’s an open-weights model, can you learn an embedding-space context/prompt that improves performance?

I use/train a simple 3-layer network: it predicts from the last embedding of the prompt to a new embedding which is then fed into the frozen LLM Image code: github.com/brendanhogan/D…
May 23, 2025 9 tweets 2 min read
introducing: picoDeepResearch

multi-turn tool use + soft rewards + self-play + GRPO

You define the arena (report prompts + judging principles)

the model generates reports, uses tools (web search), then competes in round-robin battles judged by an LLM

winner gets the gradient Image
Image
Code:

all still just pytorch, no vLLM/TRL/etc

inspired by OpenAI’s Deep Research, but made “pico”, just enough to run real experiments, fine-tune real models, and build intuition

these results were using qwen3 -14Bgithub.com/brendanhogan/p…
May 15, 2025 6 tweets 2 min read
new project -  training a vLLM to solve CAPTCHAs with rl (grpo on F1 score).

introduced a “tool” for click_screen(x, y).

dataset is from cityscapes, F1 goes from 0.11 to ~0.78. details below Image

Image
code: github.com/brendanhogan/D…
Apr 27, 2025 7 tweets 2 min read
i added a basic implementation of deepseek’s grm/spct paper to the debate framework - just many rounds of principles/critiques for the scoring

similar early win rate vs gpt-4o-mini. and anecdotally, the arguments read much better and are less reward hacky to me. gh below Image
Image
Github:

this code is very much a work in progress - its pretty hard coded for the debate framework rngithub.com/brendanhogan/D…
Apr 1, 2025 7 tweets 8 min read
comedy has been achieved internally
qwen-2.5-7B is able to get the following win rate for jokes vs gpt-4o-mini
the prompt is roughly ‘generate a larry david style rant on {subject}’ and the judge determines who is funnier - more details and examples in comments Image code available here with the dataset: i do think its interesting that a 1.5B model couldnt ever win - whether the judge was itself or gpt-4o-minigithub.com/brendanhogan/D…
Mar 28, 2025 8 tweets 3 min read
new project: teaching LLMs to debate through self-play!
Using R1 style GRPO with LLM-judged round robin tournaments, qwen 2.5-1.5B learns to improve its arguments - going from winning 3% to 95% of debates against gpt-4o-mini. No hand-crafted rewards, just models learning from each other - code and more info below  🤖Image how it works: During training, the model generates multiple debate responses on the same topic. A judge (the base qwen2.5-1.5B model) LLM evaluates these against each other in a round-robin tournament, creating soft rewards that help the model learn which arguments work better. Github Code: (new branch)github.com/brendanhogan/D…