Brendan Hogan Profile picture
Aug 24 7 tweets 2 min read Read on X
just pushed my first multi-turn RL environment to @PrimeIntellect

the setup: the model gets the story title + question from QuALITY (long stories, multiple-choice questions).

tts only tool: agentic RAG search over the story. Image
this is an idea I have been toying with for a while but didn’t get around to doing. I had a paper last year about a twist on a RAG method and primarily experimented on this dataset.
i really like this dataset; it’s sort of harder-to-read short stories, and the questions really require (imo) a good and subtle understanding of the paper.
so I liked the idea of building an agentic RAG system over this dataset - each story gets chunked up and embedded using OpenAI’s embeddings - then the agent gets to choose the query to embed and search.
the chunks are very small, so it’s a pretty difficult task. But I think learning here would require really reasoning about the question and the structure of this kind of writing.
thanks to @PrimeIntellect for building this and @willccbb for the invite! I think this is such an incredible initiative that can go in so many exciting directions. Looking forward to publishing a lot more RL environments and building agi together :) !
Original QuALITY paper: arxiv.org/abs/2112.08608

My earlier RAG paper: arxiv.org/abs/2409.15566

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Brendan Hogan

Brendan Hogan Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @brendanh0gan

Aug 13
introducing qqWen: our fully open-sourced project (code+weights+data+detailed technical report) for full-stack finetuning (pretrain+SFT+RL) a series of models (1.5b, 3b, 7b, 14b & 32b) for a niche financial programming language called Q

All details below! Image
Image
Links:

Technical Report: arxiv.org/abs/2508.06813

Models +Data on HuggingFace: huggingface.co/collections/mo…

Full Code: github.com/morganstanley/…
Note for Q Practitioners:

our SFT dataset/benchmark is made from leetcode problems, which might not reflect how Q is really used.

for general Q purposes, the pretrained models might be better than the fully fine-tuned ones
Read 26 tweets
Jul 11
doing this now for my debate framework: gpt4.1 vs gpt4.1 advised by qwen 3B

gpt4.1 w qwens advice debates itself in elo/tournament style to get advantage

advantage is used to grpo qwen to give better advice

you can fine tune api models with rl'd context Image
Again I really like this idea - for most practical agentic work I have done, you almost always just want to use a big API model - it works the best, and is quickest to get a good prototype

and training a big model is infeasible often
Read 9 tweets
Jul 3
other idea - if you assume it’s an open-weights model, can you learn an embedding-space context/prompt that improves performance?

I use/train a simple 3-layer network: it predicts from the last embedding of the prompt to a new embedding which is then fed into the frozen LLM Image
the predicted context embedding is fed into the frozen network, which (with sampling) generates reasoning chains as normal, which then get scored, and the gradient is computed in the normal way
Read 6 tweets
May 23
introducing: picoDeepResearch

multi-turn tool use + soft rewards + self-play + GRPO

You define the arena (report prompts + judging principles)

the model generates reports, uses tools (web search), then competes in round-robin battles judged by an LLM

winner gets the gradient Image
Image
Code:

all still just pytorch, no vLLM/TRL/etc

inspired by OpenAI’s Deep Research, but made “pico”, just enough to run real experiments, fine-tune real models, and build intuition

these results were using qwen3 -14Bgithub.com/brendanhogan/p…
im particularly excited about this project - the other ones felt fun, but exploratory - this feels like pulling everything together into a single framework, and to produce an end model that could be really useful.
Read 9 tweets
May 15
new project -  training a vLLM to solve CAPTCHAs with rl (grpo on F1 score).

introduced a “tool” for click_screen(x, y).

dataset is from cityscapes, F1 goes from 0.11 to ~0.78. details below Image

Image
I built the dataset by taking Cityscapes + segmentation masks, gridding each image, and labeling any square as positive if >10% of its pixels were cars or motorcycles.
Read 6 tweets
Apr 27
i added a basic implementation of deepseek’s grm/spct paper to the debate framework - just many rounds of principles/critiques for the scoring

similar early win rate vs gpt-4o-mini. and anecdotally, the arguments read much better and are less reward hacky to me. gh below Image
Image
Github:

this code is very much a work in progress - its pretty hard coded for the debate framework rngithub.com/brendanhogan/D…
also a lot left to do experimentally - including just letting the first run play out to 300+ steps to see what happens
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(