Latest Twitter Threads by @hughbzhang on Thread Reader App

Sep 23, 2024 • 12 tweets • 4 min read

OpenAI recently released the o1 family of models and a graph showing scaling laws for test-time compute — sadly without the x-axis labeled.

Using only the public o1-mini API, I tried to reconstruct the graph as closely as possible. Original on left, my best attempt on right.

The OpenAI API does not allow you to easily control how many tokens to spend at test-time. I hack my way around this by telling o1-mini how long I want it to think for. Afterwards, I can figure out how many tokens were actually used based on how much the query cost!

Sep 6, 2024 • 12 tweets • 3 min read

Enabling LLMs to reason more deeply at inference time via search is one of the most exciting directions in AI right now. We introduce PlanSearch, a novel method for code generation that searches over high-level "plans" in natural language as a means of encouraging diversity.

@evanzwangg @summeryue0 @squeakymouse777 @ellev3n11 @SeanHendryx PlanSearch significantly outperforms baselines on three popular code benchmarks: HumanEval+, MBPP+, and LiveCodeBench, a contaminated-free benchmark for competitive coding, across all models considered.

May 2, 2024 • 14 tweets • 3 min read

Data contamination is a huge problem for LLM evals right now. At Scale, we created a new test set for GSM8k *from scratch* to measure overfitting and found evidence that some models (most notably Mistral and Phi) do substantially worse on this new test set compared to GSM8k.

Stepping back for a moment, LLM evals are really hard because LLMs themselves are trained on basically the entire Internet at this point, so any public benchmark you make will inevitably just end up in the LLM training set.

Share this page!

Enter URL or ID to Unroll