Dimitris Papailiopoulos Profile picture
May 2 2 tweets 1 min read Read on X
I tested phi-4-reasoning on my early grad lin algebra (private) final exam at UW-Madison. It scored 100% on the first run..

Two years ago I speculated nothing useful could run locally anytime soon. I was wrong. Kids can now have a free, grad level TA, running on their PC
Being exposed to the reasoning trace is also incredible useful to understand problem solving approaches. I’m a bit mind blown

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Dimitris Papailiopoulos

Dimitris Papailiopoulos Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @DimitrisPapail

May 1
We’ve been cooking... a new open weights 14B Phi-4 reasoning model, SFT’d on ~1.4M carefully curated reasoning demonstrations from o3-mini and RL’d for a tiny bit. This model is a little beast. Image
Despite its size, it performs at or above larger open weight (QwQ-32B, R1-70B, R1) and closed (o1-mini, sonnet 3.7) models on math benchmarks like AIME/HMMT/OmniMath.

From what I saw, the model performs above the very strong O(32B) Qwen 3 models on AIME and GPQA that were released yesterday. Haven't fully tested against it yet, but we will!Image
Image
OK, first are the vibe checks:
The model is small enough to run on your (beefy) laptop, but capable enough to solve many of my favorite riddles that larger non-reasoning  (and some reasoning) models can't solve. It passed the DimitrisEval! Image
Read 15 tweets
Jun 28, 2024
Thread on our newest paper:

1/n
The initial motivation of our project was the "lost in the middle" phenomenon observed by @nelsonfliu et al.


what they observed was models like gpt & claude were bad at retrieving from the middle/end of the input context arxiv.org/pdf/2307.03172
2/n
The phenomenon was pretty striking and consisten across both multi-document question answering and key-value retrieval, eg see this Image
3/n
Since models tend to improve with finetuning, we wondered what data set would one FT on for this phenomenon to be mitigated.

Obvious answer: FT on retrieval identical to the test, BUT there's an also obvious problem: the model can overfit, and hallucinate for other tasks
Read 8 tweets
May 13, 2024
[1/n] a brief thread on why "maxing the batchsize can hurt performance".

Tuning the batchsize has a non-monotonic effect on runtime.

Larger batchsize => faster passes over data (because of GPU util+lower comm. cost)
BUT bsize affects iterations to ε acc in a weird way.
Image
2/n (old but still relevant slides)
Large bsize is good for system reasons:
=> speedup over 1 worker/thread/GPU becomes more and more aligned with linear, as bsize increases

should be relatively obvious why Image
3/n but increasing bsize has a non-monotonic effect on convergence, i..e, num of iterations to a given accuracy

Note: this is an old figure (tensorflow, boo gen Z), but the phenomenon was/is pretty universal across most supervised classification settings we tried.

WHY? Image
Read 13 tweets
Mar 21, 2024
doing a little experiment: I have Claude talk to itself, without letting it know about that fact, to see where this will converge

will share thoughts later, but so far ... it's figured out that it's likely talking to itself and that this may be part of some test...

nice
Image
Image
they even fought for a bit how to name themselves and although one suggested Claude-1 and -2 the other said no Claude-A and -B is better lol

here is current transcript, but we're not done, i'll take this to convergence.
gist.github.com/anadim/8f879f3…
awww they are buddies now!! Image
Read 9 tweets
Dec 6, 2023
I tried 14 of the multimodal reasoning examples from the @GoogleDeepMind Gemini paper on @OpenAI's chatGPT-4 (with vision). didn't even transcribe the prompts, I just pasted the images of prompts.

GPT-4 gets ~12/14 right.

14 part boring thread. Image
Example 1: Verifying a student’s solution to a physics problem.
GPT-4 gets the same answer as Gemini

Image
Image
Image
Example 2: inverse graphics, GPT-4 is not quite there, but close, i'll give it 0.5 points for the effort and the bad jpeg it had to read

Image
Image
Image
Read 18 tweets
Jul 10, 2023
1/ Our paper is out!

Teaching Arithmetic to Small Transformers

We investigate several factors that control the emergence of basic arithmetic in small transformers (e.g., nanoGPT).

paper:
Work led by:@nayoung_nylee & @KartikSreeni

Thread below. arxiv.org/abs/2307.03381


Image
Image
Image
2/ LLMs when trained on vast amounts of data, eventually learn (up to a digit length) basic arithmetic (add/mul etc). That is *surprising* !! These tasks are not explicitly encoded in the next-word prediction loss.
3/ How does GPT3 learn to add? Prior research has delved into the emergence of these capabilities as a function of resource (parameter/data) scale, but untangling the factors that elicit it quickly remains challenging due to the data complexity and the variety of tasks examined.
Read 19 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(