I tested phi-4-reasoning on my early grad lin algebra (private) final exam at UW-Madison. It scored 100% on the first run..
Two years ago I speculated nothing useful could run locally anytime soon. I was wrong. Kids can now have a free, grad level TA, running on their PC
Being exposed to the reasoning trace is also incredible useful to understand problem solving approaches. I’m a bit mind blown
@vtripolitakis @geeknik Ηλεκτρονική god dammit
@vtripolitakis @geeknik το οτι δε σε θυμάμαι είναι ίσως καλό σημάδι! Οι χειρότεροι βοηθοί ήταν και οι πιο memorable. MHL ήταν στην κορυφή, far above ηλεκτρονική :D
@vtripolitakis @geeknik ΕΦΙΑΛΤΕΣ ΜΕ ΤΑ BREADBOARD ΤΟΥ ΠΝΕΥΜΑΤΙΚΑΤΟΥ λολ
• • •
Missing some Tweet in this thread? You can try to
force a refresh
We’ve been cooking... a new open weights 14B Phi-4 reasoning model, SFT’d on ~1.4M carefully curated reasoning demonstrations from o3-mini and RL’d for a tiny bit. This model is a little beast.
Despite its size, it performs at or above larger open weight (QwQ-32B, R1-70B, R1) and closed (o1-mini, sonnet 3.7) models on math benchmarks like AIME/HMMT/OmniMath.
From what I saw, the model performs above the very strong O(32B) Qwen 3 models on AIME and GPQA that were released yesterday. Haven't fully tested against it yet, but we will!
OK, first are the vibe checks:
The model is small enough to run on your (beefy) laptop, but capable enough to solve many of my favorite riddles that larger non-reasoning (and some reasoning) models can't solve. It passed the DimitrisEval!
2/n (old but still relevant slides)
Large bsize is good for system reasons:
=> speedup over 1 worker/thread/GPU becomes more and more aligned with linear, as bsize increases
should be relatively obvious why
3/n but increasing bsize has a non-monotonic effect on convergence, i..e, num of iterations to a given accuracy
Note: this is an old figure (tensorflow, boo gen Z), but the phenomenon was/is pretty universal across most supervised classification settings we tried.
I tried 14 of the multimodal reasoning examples from the @GoogleDeepMind Gemini paper on @OpenAI's chatGPT-4 (with vision). didn't even transcribe the prompts, I just pasted the images of prompts.
GPT-4 gets ~12/14 right.
14 part boring thread.
Example 1: Verifying a student’s solution to a physics problem.
GPT-4 gets the same answer as Gemini
Example 2: inverse graphics, GPT-4 is not quite there, but close, i'll give it 0.5 points for the effort and the bad jpeg it had to read
2/ LLMs when trained on vast amounts of data, eventually learn (up to a digit length) basic arithmetic (add/mul etc). That is *surprising* !! These tasks are not explicitly encoded in the next-word prediction loss.
3/ How does GPT3 learn to add? Prior research has delved into the emergence of these capabilities as a function of resource (parameter/data) scale, but untangling the factors that elicit it quickly remains challenging due to the data complexity and the variety of tasks examined.