Lawrence Chan Profile picture
May 2 10 tweets 4 min read Read on X
A recent viral paper claims to reverse-engineer the parameter counts of frontier models: GPT-5.5 = 9.7T, Opus 4.7 = 4.0T, o1 = 3.5T, etc.

@ben_sturgeon and I investigated and found serious issues in the paper; fixing them gives GPT-5.5 as ~1.5T (90% CI: 256B-8.3T). Image
The paper, “Incompressible Knowledge Probes” (by @bojie_li), constructs a dataset of 1400 factual questions, and fits accuracy against parameter count.

By inverting the fit, Li infers the parameter count of closed-source models from their dataset scores.
arxiv.org/abs/2604.24827Image
@ben_sturgeon and I read the paper and reproduced the author’s results. We identified two serious methodological/data issues. The core idea behind the paper – the linear relationship between IKP score and log parameter count – survives, but the parameter count estimates do not.
Two issues impact the results.

Issue 1: Paper says scores aren't floored at 0 "to preserve the bluff signal" in §4.3. But released code floors them (as do the numbers Li reports).

Correcting this reduces the score of small models and halves the param/score regression slope. Image
Issue 2: many hard questions are ambiguous or wrong.

We found ~25% of researcher questions and >8% of Wikidata questions refer to ambiguous entities. Others have ambiguous gold answers. Some questions have gold answers that are incorrect. We expect more that we didn't catch. Image
After fixing both issues, IKP-derived frontier model estimated parameter counts generally drop, and confidence intervals widen:

GPT 5.5: 9.7T -> 1.5T
Claude Opus 4.7: 4.0T -> 1.1T
DeepSeek R1 (true size 671B): 424B -> 760B Image
Interestingly, two methodological issues did not affect the results: disabling vs enabling thinking affects parameter counts much less after our fixes, and various incorrect values in JSONs contained in the repository were not used to generate the paper’s numbers or figures. Image
Image
That being said, three of Li's claims survived every stress test:
1) IKP scores scales log-linearly with params
2) Xiao et al.’s “densing law” doesn't apply to IKP score over time
3) MoE total params predict knowledge better than active

Just not his frontier parameter counts.
The author claimed on Zhihu that this work was done by an AI agent in 4 days. It shows.

The website and codebase bear obvious hallmarks of careless vibe-coding: inconsistent definitions, silent failures, code that contradicts the paper text, etc.

zhihu.com/pin/2032769685…
We wrote up our investigation in detail, including a full summary of the IKP paper, our updated methodology and resulting parameter counts, and the precautions we took to ensure that our AI agents produced real results.

You can find the full post here:
lesswrong.com/posts/veFMEzDD…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lawrence Chan

Lawrence Chan Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @justanotherlaw

Mar 3
The amendment for the DoW-OAI deal may help, but I think it still fails to address key problems.

The core surveillance prohibition is limited to "intentional"/"deliberate" surveillance. If the DoW says the use is incidental, it's seemingly permitted, regardless of scale. 🧵 Image
Why isn’t “intentional” enough? The DoW has long claimed that "incidentally" sweeping up Americans' data while targeting foreigners isn’t "intentional" domestic surveillance. And prior surveillance scandals often involved “incidental” collection.
eff.org/pages/Incident…
See, e.g. the infamous testimony where DNI Clapper told Congress the NSA doesn’t collect data on millions of Americans: "Not wittingly. There are cases where they could, inadvertently perhaps, collect—but not wittingly." Snowden showed this was false.

Read 9 tweets
Feb 28
OpenAI has released the language in their contract with the DoW, and it's exactly as Anthropic was claiming: "legalese that would allow those safeguards to be disregarded at will".

Note: the first paragraph doesn't say "no autonomous weapons"! It says "AI can't control autonomous weapons as long as existing law (that doesn't exist) or the DoD says so."

Similarly, the mass surveillance use cases will "comply with existing law", but many forms of data collection that we'd consider "mass surveillance" are things that the NSA has consistently argued are legal under current law.Image
This, of course, did not stop OpenAI from blatantly misrepresenting this language in the blog post and in Sam Altman's tweets! Image
They also make the claim that because their model lives on their cloud and isn't "edge deployment", that this by definition would not count as "fully autonomous weapons". Maybe this is _technically true_ in that the weapon needs to communicate with their server, but it does not guarantee human-in-the-loop, and does not mean that their model will not make kill decisions on behalf of drones or missiles linked to it.Image
Image
Read 6 tweets
Dec 20, 2024
Besides o3, today OpenAI also published a “new paradigm” for alignment – “Deliberative Alignment” – which, if I’m reading the paper correctly, is Anthropic’s Constitutional AI approach straightforwardly applied to o1. Image
In the Deliberative Alignment paper, Guan et al. take a dataset of OAI policy-violating prompts, first use supervised finetuning to distill policy specifications into the generation model, then use a rating model with access to the policy specs to RLAIF the model. Image
Astute observers may notice the similarities to Anthropic’s Constitutional AI paper, where they … take a dataset of harmful prompts, sft to distill the constitution into the generation model via revisions, and then RLAIF the model using a rating model w/ access to the constitution.Image
Read 14 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(