A recent viral paper claims to reverse-engineer the parameter counts of frontier models: GPT-5.5 = 9.7T, Opus 4.7 = 4.0T, o1 = 3.5T, etc.
@ben_sturgeon and I investigated and found serious issues in the paper; fixing them gives GPT-5.5 as ~1.5T (90% CI: 256B-8.3T).
The paper, “Incompressible Knowledge Probes” (by @bojie_li), constructs a dataset of 1400 factual questions, and fits accuracy against parameter count.
By inverting the fit, Li infers the parameter count of closed-source models from their dataset scores. arxiv.org/abs/2604.24827
@ben_sturgeon and I read the paper and reproduced the author’s results. We identified two serious methodological/data issues. The core idea behind the paper – the linear relationship between IKP score and log parameter count – survives, but the parameter count estimates do not.
Two issues impact the results.
Issue 1: Paper says scores aren't floored at 0 "to preserve the bluff signal" in §4.3. But released code floors them (as do the numbers Li reports).
Correcting this reduces the score of small models and halves the param/score regression slope.
Issue 2: many hard questions are ambiguous or wrong.
We found ~25% of researcher questions and >8% of Wikidata questions refer to ambiguous entities. Others have ambiguous gold answers. Some questions have gold answers that are incorrect. We expect more that we didn't catch.
After fixing both issues, IKP-derived frontier model estimated parameter counts generally drop, and confidence intervals widen:
GPT 5.5: 9.7T -> 1.5T
Claude Opus 4.7: 4.0T -> 1.1T
DeepSeek R1 (true size 671B): 424B -> 760B
Interestingly, two methodological issues did not affect the results: disabling vs enabling thinking affects parameter counts much less after our fixes, and various incorrect values in JSONs contained in the repository were not used to generate the paper’s numbers or figures.
That being said, three of Li's claims survived every stress test: 1) IKP scores scales log-linearly with params 2) Xiao et al.’s “densing law” doesn't apply to IKP score over time 3) MoE total params predict knowledge better than active
Just not his frontier parameter counts.
The author claimed on Zhihu that this work was done by an AI agent in 4 days. It shows.
The website and codebase bear obvious hallmarks of careless vibe-coding: inconsistent definitions, silent failures, code that contradicts the paper text, etc.
We wrote up our investigation in detail, including a full summary of the IKP paper, our updated methodology and resulting parameter counts, and the precautions we took to ensure that our AI agents produced real results.
The amendment for the DoW-OAI deal may help, but I think it still fails to address key problems.
The core surveillance prohibition is limited to "intentional"/"deliberate" surveillance. If the DoW says the use is incidental, it's seemingly permitted, regardless of scale. 🧵
Why isn’t “intentional” enough? The DoW has long claimed that "incidentally" sweeping up Americans' data while targeting foreigners isn’t "intentional" domestic surveillance. And prior surveillance scandals often involved “incidental” collection. eff.org/pages/Incident…
See, e.g. the infamous testimony where DNI Clapper told Congress the NSA doesn’t collect data on millions of Americans: "Not wittingly. There are cases where they could, inadvertently perhaps, collect—but not wittingly." Snowden showed this was false.
OpenAI has released the language in their contract with the DoW, and it's exactly as Anthropic was claiming: "legalese that would allow those safeguards to be disregarded at will".
Note: the first paragraph doesn't say "no autonomous weapons"! It says "AI can't control autonomous weapons as long as existing law (that doesn't exist) or the DoD says so."
Similarly, the mass surveillance use cases will "comply with existing law", but many forms of data collection that we'd consider "mass surveillance" are things that the NSA has consistently argued are legal under current law.
This, of course, did not stop OpenAI from blatantly misrepresenting this language in the blog post and in Sam Altman's tweets!
They also make the claim that because their model lives on their cloud and isn't "edge deployment", that this by definition would not count as "fully autonomous weapons". Maybe this is _technically true_ in that the weapon needs to communicate with their server, but it does not guarantee human-in-the-loop, and does not mean that their model will not make kill decisions on behalf of drones or missiles linked to it.
Besides o3, today OpenAI also published a “new paradigm” for alignment – “Deliberative Alignment” – which, if I’m reading the paper correctly, is Anthropic’s Constitutional AI approach straightforwardly applied to o1.
In the Deliberative Alignment paper, Guan et al. take a dataset of OAI policy-violating prompts, first use supervised finetuning to distill policy specifications into the generation model, then use a rating model with access to the policy specs to RLAIF the model.
Astute observers may notice the similarities to Anthropic’s Constitutional AI paper, where they … take a dataset of harmful prompts, sft to distill the constitution into the generation model via revisions, and then RLAIF the model using a rating model w/ access to the constitution.