Brendan Hogan Profile picture
Dec 7 5 tweets 3 min read Read on X
🎄 Advent of Small ML: Day 7 🎄 Topic: Entropy-Based Rewards (Forcing the model to "keep its options open")

there’s a fascinating recent paper (Layer by Layer: Uncovering Hidden Representations in Language Models - arxiv.org/abs/2502.02013 - shown to me by @aditjain1980) showing that reasoning models tend to have higher entropy in their middle layers

basically, instead of collapsing to an answer early, they keep more possibilities "alive" in their hidden states while thinking.

it made me think - if high entropy correlates with better reasoning, can we force the model to reason better by explicitly rewarding high entropy?

so I added a Matrix-based Entropy reward (Rényi entropy on eigenvalues) to GRPO training on the MATH500 dataset, rewarding the amount of entropy on the middle 10 layers of qwen 2.5 7b

the initial results were mixed.

when I just rewarded entropy, the model definitely increased its entropy... but it didn't get better at math. It just learned to be "confused" and exploratory without actually converging on answers.

It produced some pretty funny outputs, going on weird tangents and "overthinking" simple problems (examples below)

But then I changed the rewarding rule: Only reward high entropy if the final answer is CORRECT.

this worked (sort of) - it gave a 2.5% performance boost over the baseline.

this is a proof of concept that we can use RL to shape the internal dynamics of how a model thinks, not just its final output tokens.

Repo + Plots belowImage
Entropy results - entropy of 10 layers throughout training Image
Pass@1 on eval set for math500 - its very minor but the peak score is higher when entropy is rewarded Image
qualitatively I love the reasoning traces from the higher entropy models - feels very refreshing/not normal llm speak - some examples below Image
Image
Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Brendan Hogan

Brendan Hogan Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @brendanh0gan

Dec 3
🎄 Advent of Small ML: Day 3 Topic: Adversarial Unsupervised GRPO (Automated Red Teaming) 🎄

yesterday, I showed how to train a vlm without labels using a cyclegan-ish style loop. today I wanted to expand on that and make it harder/better

instead of training on random images, can we have an active adversary that hunts for the model's blind spots?

the hypothesis: if we train the model against an adversary that generates "hard" images, the model should become more robust and generalize better than just seeing random data.

the experiment: I set up a competitive game (gan-style) between two models:

the base model: tries to describe images so they can be recreated (reward = high cosine similarity) (same as yesterday)

the adversary: tries to generate prompts for images that the base model fails to describe well (reward = low cosine similarity).

basically, the adversary acts as an automated red team, constantly searching for the base model's weaknesses.

it actually beat the non-adversarial baseline from yesterday in the early stages, though they eventually converged to similar levels.

Repo + Plots + more results belowImage
This is the adversary's reward throughout training (1-base models cosine sim) - mostly stable Image
Read 7 tweets
Dec 2
🎄 Advent of Small ML: Day 2 : Teaching a VLM to reason about charts with Unsupervised GRPO🎄

a big use case for VLMs is parsing chart data for Q&A. CharXiv from @zwcolin is a great recent benchmark for this, but I had a question: Can we do this in an unsupervised way?

If we don't need labeled Q/A pairs for every chart, we can leverage data much more cheaply.

The inspiration came from CycleGAN and the idea of using a numerical loss as a proxy for how "good" the text generated by the VLM actually is. (Big inspo here is @rosmine_b’s SVG work - go check it out).

The Experiment: I set up a loop to treat the VLM like an autoencoder:

1. Take a chart image.

2. Prompt the VLM to describe it.

3. Feed that description into an image generator (Flux Schnell).

4. Measure the cosine similarity between the regenerated image and the original (using DINO)

This similarity score becomes the reward signal for GRPO. The logic: to accurately recreate the image, the model must extract the most salient features in its description.

The methods: I used Qwen 2.5 3B and DINOv2 for the embeddings (to capture semantic info, not just pixels).

Results for the Proxy Task: The model consistently improved its cosine similarity scores.

Results for Transfer Learning : Despite seeing zero labeled questions during training, this transferred to CharXiv reasoning questions, showing an ~7% improvement in pass@1 at the peak.

It’s a small experiment with a small model, but I think the result is really cool: the model got better at reasoning without seeing a single reasoning label.

I’m really interested in exploring more of these CycleGAN-esque / "LLM as Autoencoder" domains to escape the need for labeled data.

Repo + Plots in the comments.Image
Results: For the evaluation set - the cosine similarity between the regenerated image (from the LLM prompt send to flux-schnell) - it is definitely learning! Image
Read 4 tweets
Aug 24
just pushed my first multi-turn RL environment to @PrimeIntellect

the setup: the model gets the story title + question from QuALITY (long stories, multiple-choice questions).

tts only tool: agentic RAG search over the story. Image
this is an idea I have been toying with for a while but didn’t get around to doing. I had a paper last year about a twist on a RAG method and primarily experimented on this dataset.
i really like this dataset; it’s sort of harder-to-read short stories, and the questions really require (imo) a good and subtle understanding of the paper.
Read 7 tweets
Aug 13
introducing qqWen: our fully open-sourced project (code+weights+data+detailed technical report) for full-stack finetuning (pretrain+SFT+RL) a series of models (1.5b, 3b, 7b, 14b & 32b) for a niche financial programming language called Q

All details below! Image
Image
Links:

Technical Report: arxiv.org/abs/2508.06813

Models +Data on HuggingFace: huggingface.co/collections/mo…

Full Code: github.com/morganstanley/…
Note for Q Practitioners:

our SFT dataset/benchmark is made from leetcode problems, which might not reflect how Q is really used.

for general Q purposes, the pretrained models might be better than the fully fine-tuned ones
Read 26 tweets
Jul 11
doing this now for my debate framework: gpt4.1 vs gpt4.1 advised by qwen 3B

gpt4.1 w qwens advice debates itself in elo/tournament style to get advantage

advantage is used to grpo qwen to give better advice

you can fine tune api models with rl'd context Image
Again I really like this idea - for most practical agentic work I have done, you almost always just want to use a big API model - it works the best, and is quickest to get a good prototype

and training a big model is infeasible often
Read 9 tweets
Jul 3
other idea - if you assume it’s an open-weights model, can you learn an embedding-space context/prompt that improves performance?

I use/train a simple 3-layer network: it predicts from the last embedding of the prompt to a new embedding which is then fed into the frozen LLM Image
the predicted context embedding is fed into the frozen network, which (with sampling) generates reasoning chains as normal, which then get scored, and the gradient is computed in the normal way
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(