The paper argues that hallucinations are not mysterious glitches but the predictable result of how LLMs are trained and evaluated.
Pretraining creates statistical pressure to make errors, and post-training benchmarks often reward confident guessing over honest uncertainty.
The fix is to realign mainstream evaluations to stop penalizing abstentions.
Pretraining inevitably produces some errors
Even if you trained on flawless text, the way models learn guarantees they’ll still slip up sometimes.
That’s because the training goal pushes them to give answers instead of saying “I don’t know.”
The calibration histograms below illustrate that GPT-4 style base models are well calibrated prior to RL, consistent with this claim.
Arbitrary facts drive a floor on hallucinations.
Details like birthdays or one-off events show up rarely in training data. If a fact appears only once, the model is just as likely to guess wrong later.
So for these “one-shot facts,” hallucinations are baked in.
Weak models add to the problem.
When the model family cannot represent the needed distinctions, errors persist.
The paper formalizes this via an agnostic-learning bound and gives simple cases like multiple choice, where even optimal thresholding leaves a fixed error tied to model capacity, with an example showing classic n-gram models must fail on certain context dependencies.
Post-training often reinforces guessing
Most benchmarks score models only on right vs. wrong answers.
Saying “I don’t know” gets you zero, while making a confident guess could get you a point.
That system rewards bluffing, so models learn to “sound sure” even when they’re not.
The authors survey widely used leaderboards and find abstentions largely penalized, explaining why overconfident hallucinations persist despite mitigation efforts.
The fix is to reward honesty
The authors suggest changing benchmarks so models aren’t punished for admitting uncertainty.
If we add clear rules about when to guess and when to abstain, models will learn to only answer when they’re reasonably confident.
This promotes behavioral calibration, where models choose between answering and abstaining according to the target confidence, and should steer the field toward more trustworthy systems.
The spec-init slash command prompt, if you want to try it:
"Your task is to first help me build a spec for my new project in ARGUMENT.
Use the AskUserQuestion Tool to help build the spec in ARGUMENT by interviewing me and gathering requirements and details about the project implementation, UI & UX, tech stack, concerns, tradeoffs, etc.
Make sure questions are not obvious and probe deeper into the underlying needs and constraints.
Interview me continually and systematically until the spec is complete. Document all responses and insights to create a comprehensive and well-structured specification that serves as the foundation for the project."
Just built a new skill in Claude Code using Opus 4.5.
The skill uses Gemini 3 Pro (via API) for designing web pages.
Look at what it generated from one simple prompt.
If you have been designing websites with Claude Code, you already know how generic they turn out.
So I built a skill that uses Gemini 3 Pro to lead creative direction and generate designs. It is extremely good at this.
Opus 4.5 then integrates all that into our app.
The prompt I used: "I want to design the landing page for a new AI game. We want it to be futuristic and all that, and use animations as much as possible."
I will test with some other prompts and see how far I can push this. But the results are very exciting already.
This is one of the most insane things Nano Banana Pro 🍌 can do.
It can reproduce figures with mind-blowing precision.
No competition in this regard!
Prompt: "Please reproduce this chart in high quality and fidelity and offer annotated labels to better understand it."
When I tried this for the first time, I didn't expect that this was possible.
The level of understanding this requires is what's remarkable about it all.
The levels of personalization this unlocks are also impressive.
"Can you convert it into a cartoonish version?"
Just look at this 🤯
"Can you create a delightful cartoonish version of this table. And please put cute colors and icons along with interesting annotations to make it more readable."