Hugh Zhang Profile picture
Sep 23 12 tweets 4 min read Read on X
OpenAI recently released the o1 family of models and a graph showing scaling laws for test-time compute — sadly without the x-axis labeled.

Using only the public o1-mini API, I tried to reconstruct the graph as closely as possible. Original on left, my best attempt on right.
Image
Image
The OpenAI API does not allow you to easily control how many tokens to spend at test-time. I hack my way around this by telling o1-mini how long I want it to think for. Afterwards, I can figure out how many tokens were actually used based on how much the query cost!
Here’s a plot of how many tokens we ask o1-mini to think for against how many it actually uses. If you request a very small token budget, it often refuses to listen. The same for a very large token budget. But in the region between 2^4 and 2^11 it seems to work reasonable well. Image
When restricting to just that region of 2^4 (16) to 2^11 (2048), we get the following curve. Note that o1-mini doesn't really "listen" to the precise number of tokens we ask it to use. In fact, in this region, it seems to consistently use ~8 times as many tokens as we ask for! Image
This only gets us ~2^14 = 16K tokens spent at test-time. Despite fiddling with various prompts, I was unable to get the model to reliably "think" longer.

To scale further, I took a page from the self-consistency paper by doing repeated sampling and then taking a majority vote.
Image
Image
So one natural question when seeing scaling laws graphs is: how long does this trend continue? For the original scaling laws for pre-training, each additional datapoint cost millions of dollars so it took some time to see the additional datapoints.
In this case, for scaling test-time compute, the reconstructed graph was surprisingly cheap to make. 2^17 tokens / problem * 30 AIME problems from 2024 = ~4M tokens. At $12 / 1M output tokens, the largest inference run only costs about $50. o1-mini doesn't cost that much!
Sadly, self consistency / majority vote doesn't seem to scale much past the initial gains. I increase the samples 16x beyond the most successful run, but there are no more gains beyond 2^17 total tokens and getting ~70%, which is just under what the original OpenAI graph showed. Image
This is consistent with past work suggesting that majority voting saturates at some point (classic statistics says something similar too). So we won't be able to hack around getting the models to think longer if we want to scale up test-time compute. Image
I expect that forcing the model to think for longer in token space is likely more effective than repeated sampling + majority vote in terms of scaling test-time compute. Given how cheap the replication was, I'm super curious to see how this approach scales.
If you want to play with the data / code yourself, it’s all available below. All code written with help from @cursor_ai, Claude 3.5 Sonnet, and of course, o1.

github.com/hughbzhang/o1_…
Finally, wanted to give a huge shout out to @cHHillee, @jbfja, @ecjwg and Celia Chen for feedback / thoughts on initial versions of this!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Hugh Zhang

Hugh Zhang Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @hughbzhang

Sep 6
Enabling LLMs to reason more deeply at inference time via search is one of the most exciting directions in AI right now. We introduce PlanSearch, a novel method for code generation that searches over high-level "plans" in natural language as a means of encouraging diversity. Image
@evanzwangg @summeryue0 @squeakymouse777 @ellev3n11 @SeanHendryx PlanSearch significantly outperforms baselines on three popular code benchmarks: HumanEval+, MBPP+, and LiveCodeBench, a contaminated-free benchmark for competitive coding, across all models considered. Image
Why does PlanSearch work? One critical issue plaguing existing search approaches is that LLMs themselves often lack diversity. Most post-training methods, such as DPO or RLHF, are designed to reward a single correct response, not a diverse array of potential responses.
Read 12 tweets
May 2
Data contamination is a huge problem for LLM evals right now. At Scale, we created a new test set for GSM8k *from scratch* to measure overfitting and found evidence that some models (most notably Mistral and Phi) do substantially worse on this new test set compared to GSM8k. Image
Stepping back for a moment, LLM evals are really hard because LLMs themselves are trained on basically the entire Internet at this point, so any public benchmark you make will inevitably just end up in the LLM training set.
Ideally, you would solve this by creating custom, private evals and not releasing them. But that’s pretty hard to do typically … unless
Read 14 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(