Latest Twitter Threads by @scaling01 on Thread Reader App

Jun 8 • 8 tweets • 4 min read

A few more observations after replicating the Tower of Hanoi game with their exact prompts:

- You need AT LEAST 2^N - 1 moves and the output format requires 10 tokens per move + some constant stuff.
- Furthermore the output limit for Sonnet 3.7 is 128k, DeepSeek R1 64K, and o3-mini 100k tokens. This includes the reasoning tokens they use before outputting their final answer!

- all models will have 0 accuracy with more than 13 disks simply because they can not output that much!

- the max solvable sizes WITHOUT ANY ROOM FOR REASONING (floor(log2(output_limit/10)))
DeepSeek: 12 disks
Sonnet 3.7 and o3-mini: 13 disks

- If you actually look at the output of the models you will see that they don't even reason about the problem if it gets too large:
"Due to the large number of moves, I'll explain the solution approach rather than listing all 32,767 moves individually"

- At least for Sonnet it doesn't try to reason through the problem once it's above ~7 disks. It will state what the problem and the algorithm to solve it and then output its solution without even thinking about individual steps.

- it's also interesting to look at the models as having a X% chance of picking the correct token at each move
- even with a 99.99% probability the models will eventually make an error simply because of the exponentially growing problem size

Decomposing helps the model to focus more on reasoning as it keeps the problem size smaller but it will basically get lost in the algorithm and repeat steps.
It needs the history to pick up where it left, although Tower of Hanoi is in theory stateless. Like after each move the optimal move only depends on the current state.

This was with Gemini 2.0 Flash. So not a reasoning model. But as you can see decomposing the full problem that should be solvable into 2^n-1 steps, into chunks of 5 steps at a time worsens performance.

May 30 • 5 tweets • 6 min read

Introducing LisanBench

LisanBench is a simple, scalable, and precise benchmark designed to evaluate large language models on knowledge, forward-planning, constraint adherence, memory and attention, and long context reasoning and "stamina".

"I see possible futures, all at once. Our enemies are all around us, and in so many futures they prevail. But I do see a way, there is a narrow way through." - Paul Atreides

How it works:
Models are given a starting English word and must generate the longest possible sequence of valid English words. Each subsequent word in the chain must:
- Differ from the previous word by exactly one letter (Levenshtein distance = 1)
- Be a valid English word
- Not repeat any previously used word

The benchmark repeats this process across multiple starting words of varying difficulty. A model's final score is the cumulative length of its longest valid chains from the starting words.

Results:

- o3 is by far the best model, mainly because it is the only model that manages to escape from parts of the graph with very low connectivity and many dead-ends
(slight caveat: o3 was by far the most expensive one to run and used ~30-40k reasoning tokens per starting word)
- Opus 4 and Sonnet 4 with 16k reasoning tokens, also perform extremely, especially Opus which was able to beat o3 at 3 starting words with only one third of the reasoning tokens!
- Claude 3.7 with thinking taking 4th place ahead of o1
- other OpenAI reasoning models perform all well, but size does make a difference! o1 is ahead of o4-mini high and o3-mini
- Gemini models perform a bit worse than their Anthropic and OpenAI counterparts, but they have by far the longest outputs - they are a bit delusional and keep yapping; they don't realize and stop when they made a mistake

- strongest non-reasoning models: Grok-3, GPT-4.5, Sonnet 3.5 and 3.7, Opus 4, Sonnet 4, DeepSeek-V3 and Gemini 1.5 Pro
- Grok 3, Sonnet 3.5 and 3.7 are a surprise!!

Inspiration:
LisanBench draws from benchmarks like AidanBench and SOLO-Bench. However, unlike AidanBench, it’s extremely cost-effective, trivially verifiable and doesn't rely on an Embedding model - the entire benchmark cost only ~$50 for 57 models.
And unlike SOLO-Bench, it explicitly tests knowledge and applies stronger constraints, which makes it more challenging!

Verification:
Verification uses the words_alpha.txt dictionary from github.com/dwyl/english-w… (~370,105 words), but for scalability, only words from the largest connected component (108,448 words) are used.

Easy Scaling, Difficulty Adjustment & Accuracy improvements:
- Scaling and Accuracy: Just add more starting words or increase the number of trials per word.
- Difficulty: Starting words vary widely - from those with 72 neighbors to those with just 1 - effectively distinguishing between moderately strong and elite models. Difficulty can also be gauged via local connectivity and branching factor.

Why is it challenging?
LisanBench uniquely stresses:
- Forward planning: avoiding dead ends by strategic word choices - models must find the narrow way through
- Knowledge: wide vocabulary is essential
- Memory and Attention: previously used words must not be repeated
- Precision: strict adherence to Levenshtein constraints
- Long-context reasoning: coherence and constraint-tracking over hundreds of steps
- Output stamina: some models break early during long generations — LisanBench exposes that, which is critical for agentic use cases

The two beautiful plots below show that the starting words are very different in difficulty. Some are in low connectivity regions, some in high-connectivity regions and others are just surrounded by dead-ends!

Just as Paul Atreides had to navigate the political, cultural, and metaphysical maze of his destiny, LLMs in LisanBench must explore vast word graphs, searching for the Golden Path - the longest viable chain without collapse.

We will know the chosen model when it appears.
It will be the one that finds the Golden Path and avoids every dead end. Right now, for the most difficult starting word "abysmal", the longest chain found is just 2, although it is also part of the >100k connected component. So there is a narrow way through!

More plots with full leaderboard below!

Full Leaderboard:

Feb 27 • 7 tweets • 2 min read

GPT-4.5 System Card

"Our largest and most knowledgeable model yet"
"scales pre-training further"

GPT-4.5 Research Engineer Interviews

Dec 16, 2024 • 5 tweets • 3 min read

Let's review OpenAI's 12 days of shipmas so far:

Day 1 - o1 and ChatGPT Pro:
- delivered a product they promised us months ago
- the launch was horrendous because of bad, missing and out-of-date benchmarks
- despite the failed launch, still no new benchmarks for o1 models

( - anouncing o1 pro and ChatGPT Pro on this day was stupid imo, Pro Tier only makes sense when you know of Sora
- i love o1 but they should have done the price cuts on that day instead of o1 pro )

Overall Rating: horrendous presentation of good products and basically no surprise factor - 3.5/10

Day 2 - Reinforcement Fine-Tuning ALPHA:
- nice idea, could be very useful for businesses
- showed some cool applications
- just an alpha and completely useless for 95% of their users
- a surprise

Overall Rating: good presentation of good products but again just an alpha preview and limited applications - 6.5/10

Day 3 - Sora:
- very cool feature
- no surprise factor
- server issues
- extremely limited usage
- poor implementation, like no image preview (just read my post why I think that, it could be so much better)
- europoors are cooked
- competitors offer the same

Overall Rating: very similar to o1 launch cool product but very bad implementation and presentation - 3/10

Day 4 - Canvas:
- decent feature
- absolutely no wow or surprise factor
- I guess it can be used in CustomGPTs which is nice to have
- competitors offer the same

Overall Rating: honestly no remarks, very neutral - 5/10

Day 5 - ChatGPT Integration with Apple Intelligence:
- siri using ChatGPT to generate responses
- document analysis on MacOs
- vision features for iPhone 16
- could have literally been a sidenote in the changelog
- RIP to all android poors

Overall Rating: at least some features but jesus ... - 2/10

Day 6 - Advanced Voice with Video:
- video understand is a useful feature, no doubt about that but the examples were so USELESS
- HOHOHO cringe santa
- but again no wow or surprise factor
- europoors are cooked once more
- competitors offer the same

Overall Rating: extremely useful product, lacking presentation- 5/10

Day 7 - Projects:
- organizing information is always good
- no wow or surprise factor
- competitors offer the same

Overall Rating: could've been a post on X - 4.5/10

Day 8 - Search:
- free users gain access to search - good for the poors
- search with AVM is nice
- in app maps
- search already existed before, so zero wow factor
- competitors offer the same

Overall Rating: 5.5/10

So far ~4.4/10 - Shipmas has been slightly underwhelming, unsurprising and unfortunately overshadowed by the botched launches of o1 and Sora

o1, o1-pro and Sora could've been solid 7s or 8s

AVM is a very cool and useful feature, but the presentation just lacked substanced and good examples. I know you think it's all fun with the christmas theme, but please just show me that this product is useful for my daily life!
Like help with homework, show how to jumpstart a car, hell even get the blind guy again... Just show anything more useful

Canvas, Search and Projects should've been in one presentation - alone they are not impressive but as a whole they are a solid Quality of Life improvement in

RFT was so far the best and most surprising launch, but again no higher grade because it's just a preview alpha

Lastly, please don't ever make an Apple "launch" again My Sora critique:

https://x.com/scaling01/status/1866268414517387299

Dec 13, 2024 • 6 tweets • 3 min read

META JUST KILLED TOKENIZATION !!!

A few hours ago they released "Byte Latent Transformer". A tokenizer free architecture that dynamically encodes Bytes into Patches and achieves better inference efficiency and robustness!

(I was just talking about how we need dynamic tokenization that is learned during training 🥲
It's like fucking christmas!)

I don't want to talk too much about the architecture.
But here's a nice visualization from their paper.

Let's look at benchmarks instead :)

"BLT models can match the performance of tokenization-based models like Llama 3 at scales up to 8B and 4T bytes, and can trade minor losses in evaluation metrics for up to 50% reductions in inference flops!"

This is basically a perplexity vs training flops chart - scaling laws with compute. BPB is a tokenizer independent version of perplexity.

BLT is on par or better than LLama 3 BPE!

Most importantly they scale this approach to train Llama-3 8B model on 1T tokens which beats the standard Llama-3 architecture with BPE tokenizer!

Paper Link: ai.meta.com/research/publi…

Share this page!

Enter URL or ID to Unroll