Hugh Zhang Profile picture
research @scale_AI. co-created @gradientpub.
Sep 23, 2024 12 tweets 4 min read
OpenAI recently released the o1 family of models and a graph showing scaling laws for test-time compute — sadly without the x-axis labeled.

Using only the public o1-mini API, I tried to reconstruct the graph as closely as possible. Original on left, my best attempt on right.
Image
Image
The OpenAI API does not allow you to easily control how many tokens to spend at test-time. I hack my way around this by telling o1-mini how long I want it to think for. Afterwards, I can figure out how many tokens were actually used based on how much the query cost!
Sep 6, 2024 12 tweets 3 min read
Enabling LLMs to reason more deeply at inference time via search is one of the most exciting directions in AI right now. We introduce PlanSearch, a novel method for code generation that searches over high-level "plans" in natural language as a means of encouraging diversity. Image @evanzwangg @summeryue0 @squeakymouse777 @ellev3n11 @SeanHendryx PlanSearch significantly outperforms baselines on three popular code benchmarks: HumanEval+, MBPP+, and LiveCodeBench, a contaminated-free benchmark for competitive coding, across all models considered. Image
May 2, 2024 14 tweets 3 min read
Data contamination is a huge problem for LLM evals right now. At Scale, we created a new test set for GSM8k *from scratch* to measure overfitting and found evidence that some models (most notably Mistral and Phi) do substantially worse on this new test set compared to GSM8k. Image Stepping back for a moment, LLM evals are really hard because LLMs themselves are trained on basically the entire Internet at this point, so any public benchmark you make will inevitably just end up in the LLM training set.