I tested phi-4-reasoning on my early grad lin algebra (private) final exam at UW-Madison. It scored 100% on the first run..
Two years ago I speculated nothing useful could run locally anytime soon. I was wrong. Kids can now have a free, grad level TA, running on their PC
Being exposed to the reasoning trace is also incredible useful to understand problem solving approaches. I’m a bit mind blown
@vtripolitakis @geeknik Ηλεκτρονική god dammit
@vtripolitakis @geeknik το οτι δε σε θυμάμαι είναι ίσως καλό σημάδι! Οι χειρότεροι βοηθοί ήταν και οι πιο memorable. MHL ήταν στην κορυφή, far above ηλεκτρονική :D
@vtripolitakis @geeknik ΕΦΙΑΛΤΕΣ ΜΕ ΤΑ BREADBOARD ΤΟΥ ΠΝΕΥΜΑΤΙΚΑΤΟΥ λολ
• • •
Missing some Tweet in this thread? You can try to
force a refresh
1/ New paper! "Wait, Wait, Wait… Why Do Reasoning Models Loop?"
Under greedy/low-temp decoding, reasoning LLMs get stuck in loops repeating themselves, wasting test-time compute and sometimes never terminating!
We study why this🔁 happens and why increasing temp is a band-aid
2/ Looping in reasoning LLMs isn't an edge case!
On AIME, we flag a response as looping if a 30-gram repeats >20 times. Across open reasoning LLMS (qwen, openthinker, phi, llama-r1) we see:
Low temps => more looping
Smaller models => more looping
Harder problems => more looping
3/ Most telling: distilled students loop far more than their teachers (e.g. 30% in openthinker-3-1.5B vs 4% in its teacher QwQ-32B).
The student-teacher gap points to errors in learning as a key cause.
If the student had learned the teacher’s distribution perfectly, it should not loop significantly more
We’ve been cooking... a new open weights 14B Phi-4 reasoning model, SFT’d on ~1.4M carefully curated reasoning demonstrations from o3-mini and RL’d for a tiny bit. This model is a little beast.
Despite its size, it performs at or above larger open weight (QwQ-32B, R1-70B, R1) and closed (o1-mini, sonnet 3.7) models on math benchmarks like AIME/HMMT/OmniMath.
From what I saw, the model performs above the very strong O(32B) Qwen 3 models on AIME and GPQA that were released yesterday. Haven't fully tested against it yet, but we will!
OK, first are the vibe checks:
The model is small enough to run on your (beefy) laptop, but capable enough to solve many of my favorite riddles that larger non-reasoning (and some reasoning) models can't solve. It passed the DimitrisEval!
2/n (old but still relevant slides)
Large bsize is good for system reasons:
=> speedup over 1 worker/thread/GPU becomes more and more aligned with linear, as bsize increases
should be relatively obvious why
3/n but increasing bsize has a non-monotonic effect on convergence, i..e, num of iterations to a given accuracy
Note: this is an old figure (tensorflow, boo gen Z), but the phenomenon was/is pretty universal across most supervised classification settings we tried.
I tried 14 of the multimodal reasoning examples from the @GoogleDeepMind Gemini paper on @OpenAI's chatGPT-4 (with vision). didn't even transcribe the prompts, I just pasted the images of prompts.
GPT-4 gets ~12/14 right.
14 part boring thread.
Example 1: Verifying a student’s solution to a physics problem.
GPT-4 gets the same answer as Gemini
Example 2: inverse graphics, GPT-4 is not quite there, but close, i'll give it 0.5 points for the effort and the bad jpeg it had to read