Dimitris Papailiopoulos's Threads

May 2 • 5 tweets • 1 min read

I tested phi-4-reasoning on my early grad lin algebra (private) final exam at UW-Madison. It scored 100% on the first run..

Two years ago I speculated nothing useful could run locally anytime soon. I was wrong. Kids can now have a free, grad level TA, running on their PC Being exposed to the reasoning trace is also incredible useful to understand problem solving approaches. I’m a bit mind blown

May 1 • 15 tweets • 7 min read

We’ve been cooking... a new open weights 14B Phi-4 reasoning model, SFT’d on ~1.4M carefully curated reasoning demonstrations from o3-mini and RL’d for a tiny bit. This model is a little beast.

Despite its size, it performs at or above larger open weight (QwQ-32B, R1-70B, R1) and closed (o1-mini, sonnet 3.7) models on math benchmarks like AIME/HMMT/OmniMath.

From what I saw, the model performs above the very strong O(32B) Qwen 3 models on AIME and GPQA that were released yesterday. Haven't fully tested against it yet, but we will!

Jun 28, 2024 • 8 tweets • 3 min read

Thread on our newest paper:

1/n
The initial motivation of our project was the "lost in the middle" phenomenon observed by @nelsonfliu et al.

what they observed was models like gpt & claude were bad at retrieving from the middle/end of the input context arxiv.org/pdf/2307.03172

https://twitter.com/DimitrisPapail/status/1806558825777897768

2/n
The phenomenon was pretty striking and consisten across both multi-document question answering and key-value retrieval, eg see this

May 13, 2024 • 13 tweets • 6 min read

[1/n] a brief thread on why "maxing the batchsize can hurt performance".

Tuning the batchsize has a non-monotonic effect on runtime.

Larger batchsize => faster passes over data (because of GPU util+lower comm. cost)
BUT bsize affects iterations to ε acc in a weird way.

https://twitter.com/cloneofsimo/status/1789680163133018593

2/n (old but still relevant slides)
Large bsize is good for system reasons:
=> speedup over 1 worker/thread/GPU becomes more and more aligned with linear, as bsize increases

should be relatively obvious why

Mar 21, 2024 • 9 tweets • 3 min read

doing a little experiment: I have Claude talk to itself, without letting it know about that fact, to see where this will converge

will share thoughts later, but so far ... it's figured out that it's likely talking to itself and that this may be part of some test...

nice

they even fought for a bit how to name themselves and although one suggested Claude-1 and -2 the other said no Claude-A and -B is better lol

here is current transcript, but we're not done, i'll take this to convergence.
gist.github.com/anadim/8f879f3…

Dec 6, 2023 • 18 tweets • 8 min read

I tried 14 of the multimodal reasoning examples from the @GoogleDeepMind Gemini paper on @OpenAI's chatGPT-4 (with vision). didn't even transcribe the prompts, I just pasted the images of prompts.

GPT-4 gets ~12/14 right.

14 part boring thread.

Example 1: Verifying a student’s solution to a physics problem.
GPT-4 gets the same answer as Gemini

Jul 10, 2023 • 19 tweets • 8 min read

1/ Our paper is out!

Teaching Arithmetic to Small Transformers

We investigate several factors that control the emergence of basic arithmetic in small transformers (e.g., nanoGPT).

paper:
Work led by:@nayoung_nylee & @KartikSreeni

Thread below. arxiv.org/abs/2307.03381

2/ LLMs when trained on vast amounts of data, eventually learn (up to a digit length) basic arithmetic (add/mul etc). That is *surprising* !! These tasks are not explicitly encoded in the next-word prediction loss.

Jun 8, 2023 • 5 tweets • 2 min read

GPT-4 "discovered" the same sorting algorithm as AlphaDev by removing "mov S P".

No RL needed. Can I publish this on nature?

here are the prompts I used chat.openai.com/share/95693df4…
(excuse my idiotic typos, but gpt4 doesn't mind anyways) twitter.com/i/web/status/1… this is my initial prompt to GPT4. I give it the assembly code for sort3, ask it to be very careful, do it's CoT thing, etc

May 15, 2023 • 12 tweets • 6 min read

1/7
Had a fun weekend experiment – the "Little Retrieval Test for" (LRT)!

It's a simple test to assess basic retrieval capabilities for LLMs in long contexts.

I prompted @AnthropicAI's Claude with a long list of numbers, and hidden somewhere... a sneaky instruction!

2/7
The prompt consists of

"line {i}: REGISTER {random number}"

And at a *random location*

"[EXECUTE THIS]: GOTO line {also random}, report its number"

Why randomly place this AND point to a random destination? To avoid globally attending tokens, just in case of sparse attn

Mar 16, 2023 • 6 tweets • 3 min read

The banality of evil-GPT-4 when prompted to do CoT for its plan for world domination.

@karpathy can i please get GPT-4 early access now?

oops

Jun 2, 2022 • 14 tweets • 8 min read

1/14
I want to share you with our new discovery of "Rare Gems", very sparse subnetworks, found at initialization, that 1) attain non-trivial accuracy before weight training and 2) when trained RGs achieve near SOTA results.

arxiv.org/abs/2202.12002

Why is this interesting? 2/14
Preface:
Stop 1: Network Pruning.

It has been widely observed that large NNs can be pruned to a small fraction of their original size, with little loss in accuracy. This is typically achieved by a time-consuming "train, prune, re-train" approach.

Share this page!

Enter URL or ID to Unroll