We’ve been cooking... a new open weights 14B Phi-4 reasoning model, SFT’d on ~1.4M carefully curated reasoning demonstrations from o3-mini and RL’d for a tiny bit. This model is a little beast.
Despite its size, it performs at or above larger open weight (QwQ-32B, R1-70B, R1) and closed (o1-mini, sonnet 3.7) models on math benchmarks like AIME/HMMT/OmniMath.
From what I saw, the model performs above the very strong O(32B) Qwen 3 models on AIME and GPQA that were released yesterday. Haven't fully tested against it yet, but we will!
OK, first are the vibe checks:
The model is small enough to run on your (beefy) laptop, but capable enough to solve many of my favorite riddles that larger non-reasoning (and some reasoning) models can't solve. It passed the DimitrisEval!
2/n (old but still relevant slides)
Large bsize is good for system reasons:
=> speedup over 1 worker/thread/GPU becomes more and more aligned with linear, as bsize increases
should be relatively obvious why
3/n but increasing bsize has a non-monotonic effect on convergence, i..e, num of iterations to a given accuracy
Note: this is an old figure (tensorflow, boo gen Z), but the phenomenon was/is pretty universal across most supervised classification settings we tried.
I tried 14 of the multimodal reasoning examples from the @GoogleDeepMind Gemini paper on @OpenAI's chatGPT-4 (with vision). didn't even transcribe the prompts, I just pasted the images of prompts.
GPT-4 gets ~12/14 right.
14 part boring thread.
Example 1: Verifying a student’s solution to a physics problem.
GPT-4 gets the same answer as Gemini
Example 2: inverse graphics, GPT-4 is not quite there, but close, i'll give it 0.5 points for the effort and the bad jpeg it had to read
2/ LLMs when trained on vast amounts of data, eventually learn (up to a digit length) basic arithmetic (add/mul etc). That is *surprising* !! These tasks are not explicitly encoded in the next-word prediction loss.
3/ How does GPT3 learn to add? Prior research has delved into the emergence of these capabilities as a function of resource (parameter/data) scale, but untangling the factors that elicit it quickly remains challenging due to the data complexity and the variety of tasks examined.