🎄 Advent of Small ML: Day 7 🎄 Topic: Entropy-Based Rewards (Forcing the model to "keep its options open")
there’s a fascinating recent paper (Layer by Layer: Uncovering Hidden Representations in Language Models - arxiv.org/abs/2502.02013 - shown to me by @aditjain1980) showing that reasoning models tend to have higher entropy in their middle layers
basically, instead of collapsing to an answer early, they keep more possibilities "alive" in their hidden states while thinking.
it made me think - if high entropy correlates with better reasoning, can we force the model to reason better by explicitly rewarding high entropy?
so I added a Matrix-based Entropy reward (Rényi entropy on eigenvalues) to GRPO training on the MATH500 dataset, rewarding the amount of entropy on the middle 10 layers of qwen 2.5 7b
the initial results were mixed.
when I just rewarded entropy, the model definitely increased its entropy... but it didn't get better at math. It just learned to be "confused" and exploratory without actually converging on answers.
It produced some pretty funny outputs, going on weird tangents and "overthinking" simple problems (examples below)
But then I changed the rewarding rule: Only reward high entropy if the final answer is CORRECT.
this worked (sort of) - it gave a 2.5% performance boost over the baseline.
this is a proof of concept that we can use RL to shape the internal dynamics of how a model thinks, not just its final output tokens.
🎄 Advent of Small ML: Day 3 Topic: Adversarial Unsupervised GRPO (Automated Red Teaming) 🎄
yesterday, I showed how to train a vlm without labels using a cyclegan-ish style loop. today I wanted to expand on that and make it harder/better
instead of training on random images, can we have an active adversary that hunts for the model's blind spots?
the hypothesis: if we train the model against an adversary that generates "hard" images, the model should become more robust and generalize better than just seeing random data.
the experiment: I set up a competitive game (gan-style) between two models:
the base model: tries to describe images so they can be recreated (reward = high cosine similarity) (same as yesterday)
the adversary: tries to generate prompts for images that the base model fails to describe well (reward = low cosine similarity).
basically, the adversary acts as an automated red team, constantly searching for the base model's weaknesses.
it actually beat the non-adversarial baseline from yesterday in the early stages, though they eventually converged to similar levels.
🎄 Advent of Small ML: Day 2 : Teaching a VLM to reason about charts with Unsupervised GRPO🎄
a big use case for VLMs is parsing chart data for Q&A. CharXiv from @zwcolin is a great recent benchmark for this, but I had a question: Can we do this in an unsupervised way?
If we don't need labeled Q/A pairs for every chart, we can leverage data much more cheaply.
The inspiration came from CycleGAN and the idea of using a numerical loss as a proxy for how "good" the text generated by the VLM actually is. (Big inspo here is @rosmine_b’s SVG work - go check it out).
The Experiment: I set up a loop to treat the VLM like an autoencoder:
1. Take a chart image.
2. Prompt the VLM to describe it.
3. Feed that description into an image generator (Flux Schnell).
4. Measure the cosine similarity between the regenerated image and the original (using DINO)
This similarity score becomes the reward signal for GRPO. The logic: to accurately recreate the image, the model must extract the most salient features in its description.
The methods: I used Qwen 2.5 3B and DINOv2 for the embeddings (to capture semantic info, not just pixels).
Results for the Proxy Task: The model consistently improved its cosine similarity scores.
Results for Transfer Learning : Despite seeing zero labeled questions during training, this transferred to CharXiv reasoning questions, showing an ~7% improvement in pass@1 at the peak.
It’s a small experiment with a small model, but I think the result is really cool: the model got better at reasoning without seeing a single reasoning label.
I’m really interested in exploring more of these CycleGAN-esque / "LLM as Autoencoder" domains to escape the need for labeled data.
Results: For the evaluation set - the cosine similarity between the regenerated image (from the LLM prompt send to flux-schnell) - it is definitely learning!
just pushed my first multi-turn RL environment to @PrimeIntellect
the setup: the model gets the story title + question from QuALITY (long stories, multiple-choice questions).
tts only tool: agentic RAG search over the story.
this is an idea I have been toying with for a while but didn’t get around to doing. I had a paper last year about a twist on a RAG method and primarily experimented on this dataset.
i really like this dataset; it’s sort of harder-to-read short stories, and the questions really require (imo) a good and subtle understanding of the paper.
introducing qqWen: our fully open-sourced project (code+weights+data+detailed technical report) for full-stack finetuning (pretrain+SFT+RL) a series of models (1.5b, 3b, 7b, 14b & 32b) for a niche financial programming language called Q
Again I really like this idea - for most practical agentic work I have done, you almost always just want to use a big API model - it works the best, and is quickest to get a good prototype
the predicted context embedding is fed into the frozen network, which (with sampling) generates reasoning chains as normal, which then get scored, and the gradient is computed in the normal way