Post

More from @brendanh0gan

Brendan Hogan

@brendanh0gan

Dec 3

🎄 Advent of Small ML: Day 3 Topic: Adversarial Unsupervised GRPO (Automated Red Teaming) 🎄

yesterday, I showed how to train a vlm without labels using a cyclegan-ish style loop. today I wanted to expand on that and make it harder/better

instead of training on random images, can we have an active adversary that hunts for the model's blind spots?

the hypothesis: if we train the model against an adversary that generates "hard" images, the model should become more robust and generalize better than just seeing random data.

the experiment: I set up a competitive game (gan-style) between two models:

the base model: tries to describe images so they can be recreated (reward = high cosine similarity) (same as yesterday)

the adversary: tries to generate prompts for images that the base model fails to describe well (reward = low cosine similarity).

basically, the adversary acts as an automated red team, constantly searching for the base model's weaknesses.

it actually beat the non-adversarial baseline from yesterday in the early stages, though they eventually converged to similar levels.

Repo + Plots + more results below

repo: github.com/brendanhogan/2…

This is the adversary's reward throughout training (1-base models cosine sim) - mostly stable

Read 7 tweets

Brendan Hogan

@brendanh0gan

Dec 2

🎄 Advent of Small ML: Day 2 : Teaching a VLM to reason about charts with Unsupervised GRPO🎄

a big use case for VLMs is parsing chart data for Q&A. CharXiv from @zwcolin is a great recent benchmark for this, but I had a question: Can we do this in an unsupervised way?

If we don't need labeled Q/A pairs for every chart, we can leverage data much more cheaply.

The inspiration came from CycleGAN and the idea of using a numerical loss as a proxy for how "good" the text generated by the VLM actually is. (Big inspo here is @rosmine_b’s SVG work - go check it out).

The Experiment: I set up a loop to treat the VLM like an autoencoder:

1. Take a chart image.

2. Prompt the VLM to describe it.

3. Feed that description into an image generator (Flux Schnell).

4. Measure the cosine similarity between the regenerated image and the original (using DINO)

This similarity score becomes the reward signal for GRPO. The logic: to accurately recreate the image, the model must extract the most salient features in its description.

The methods: I used Qwen 2.5 3B and DINOv2 for the embeddings (to capture semantic info, not just pixels).

Results for the Proxy Task: The model consistently improved its cosine similarity scores.

Results for Transfer Learning : Despite seeing zero labeled questions during training, this transferred to CharXiv reasoning questions, showing an ~7% improvement in pass@1 at the peak.

It’s a small experiment with a small model, but I think the result is really cool: the model got better at reasoning without seeing a single reasoning label.

I’m really interested in exploring more of these CycleGAN-esque / "LLM as Autoencoder" domains to escape the need for labeled data.

Repo + Plots in the comments.

Github: github.com/brendanhogan/2…

Results: For the evaluation set - the cosine similarity between the regenerated image (from the LLM prompt send to flux-schnell) - it is definitely learning!

Read 4 tweets

Brendan Hogan

@brendanh0gan

Aug 24

just pushed my first multi-turn RL environment to @PrimeIntellect

the setup: the model gets the story title + question from QuALITY (long stories, multiple-choice questions).

tts only tool: agentic RAG search over the story.

this is an idea I have been toying with for a while but didn’t get around to doing. I had a paper last year about a twist on a RAG method and primarily experimented on this dataset.

i really like this dataset; it’s sort of harder-to-read short stories, and the questions really require (imo) a good and subtle understanding of the paper.

Read 7 tweets

Brendan Hogan

@brendanh0gan

Aug 13

introducing qqWen: our fully open-sourced project (code+weights+data+detailed technical report) for full-stack finetuning (pretrain+SFT+RL) a series of models (1.5b, 3b, 7b, 14b & 32b) for a niche financial programming language called Q

All details below!

Links:

Technical Report: arxiv.org/abs/2508.06813

Models +Data on HuggingFace: huggingface.co/collections/mo…

Full Code: github.com/morganstanley/…

Note for Q Practitioners:

our SFT dataset/benchmark is made from leetcode problems, which might not reflect how Q is really used.

for general Q purposes, the pretrained models might be better than the fully fine-tuned ones

Read 26 tweets

Brendan Hogan

@brendanh0gan

Jul 11

https://twitter.com/brendanh0gan/status/1940567812533375104

doing this now for my debate framework: gpt4.1 vs gpt4.1 advised by qwen 3B

gpt4.1 w qwens advice debates itself in elo/tournament style to get advantage

advantage is used to grpo qwen to give better advice

you can fine tune api models with rl'd context

https://twitter.com/brendanh0gan/status/1940567812533375104

code: github.com/brendanhogan/D…

Again I really like this idea - for most practical agentic work I have done, you almost always just want to use a big API model - it works the best, and is quickest to get a good prototype

and training a big model is infeasible often

Read 9 tweets

Brendan Hogan

@brendanh0gan

Jul 3

other idea - if you assume it’s an open-weights model, can you learn an embedding-space context/prompt that improves performance?

I use/train a simple 3-layer network: it predicts from the last embedding of the prompt to a new embedding which is then fed into the frozen LLM

code: github.com/brendanhogan/D…

the predicted context embedding is fed into the frozen network, which (with sampling) generates reasoning chains as normal, which then get scored, and the gradient is computed in the normal way

Read 6 tweets

Share this page!

Enter URL or ID to Unroll

Brendan Hogan

Try unrolling a thread yourself!

More from @brendanh0gan

Brendan Hogan

Brendan Hogan

Brendan Hogan

Brendan Hogan

Brendan Hogan

Brendan Hogan

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!