Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

elvis

@omarsar0

Sep 6 • 7 tweets • 3 min read • Read on X

Scrolly

Everyone is talking about this new OpenAI paper.

It's about why LLMs hallucinate.

You might want to bookmark this one.

Let's break down the technical details:

Quick Overview

The paper argues that hallucinations are not mysterious glitches but the predictable result of how LLMs are trained and evaluated.

Pretraining creates statistical pressure to make errors, and post-training benchmarks often reward confident guessing over honest uncertainty.

The fix is to realign mainstream evaluations to stop penalizing abstentions.

Pretraining inevitably produces some errors

Even if you trained on flawless text, the way models learn guarantees they’ll still slip up sometimes.

That’s because the training goal pushes them to give answers instead of saying “I don’t know.”

The calibration histograms below illustrate that GPT-4 style base models are well calibrated prior to RL, consistent with this claim.

Arbitrary facts drive a floor on hallucinations.

Details like birthdays or one-off events show up rarely in training data. If a fact appears only once, the model is just as likely to guess wrong later.

So for these “one-shot facts,” hallucinations are baked in.

Weak models add to the problem.

When the model family cannot represent the needed distinctions, errors persist.

The paper formalizes this via an agnostic-learning bound and gives simple cases like multiple choice, where even optimal thresholding leaves a fixed error tied to model capacity, with an example showing classic n-gram models must fail on certain context dependencies.

Post-training often reinforces guessing

Most benchmarks score models only on right vs. wrong answers.

Saying “I don’t know” gets you zero, while making a confident guess could get you a point.

That system rewards bluffing, so models learn to “sound sure” even when they’re not.

The authors survey widely used leaderboards and find abstentions largely penalized, explaining why overconfident hallucinations persist despite mitigation efforts.

The fix is to reward honesty

The authors suggest changing benchmarks so models aren’t punished for admitting uncertainty.

If we add clear rules about when to guess and when to abstain, models will learn to only answer when they’re reasonably confident.

This promotes behavioral calibration, where models choose between answering and abstaining according to the target confidence, and should steer the field toward more trustworthy systems.

Paper:
cdn.openai.com/pdf/d04913be-3…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @omarsar0

elvis

@omarsar0

Sep 30

We are living in the most insane timeline.

I just asked Claude Code (with Claude Sonnet 4.5) to develop an MCP Server (end-to-end) that allows me to programatically create n8n workflows from within Claude Code itself.

Took about 10 mins!

You can now create n8n workflows with pure natural language from Claude Code.

This is one of the top requests in our academy: how to automate the creation of n8n workflows.

It turns out that this is a great use case for MCP.

I've already created a huge repository of n8n agentic workflows, which I can now feed directly to Claude Code to help scale the creation of workflows.

It can even create/optimize prompts and all that good stuff. Automating context engineering is next, which Claude Code is really good at, too.

Read 6 tweets

elvis

@omarsar0

Sep 28

Great work showing prompt synthesis as a new scaling axis for reasoning.

Good training data is scarce.

This work showcases a framework that might make it possible to construct high-quality training problems for reasoning-focused LLMs.

Technical details below:

This work shows that we can scale reasoning ability in LLMs by automatically generating hard, high-quality prompts instead of relying only on human-written datasets.

Core idea: Treat explanations (“rationales”) as hidden variables. The system learns to generate concept → explanation → problem using an EM loop. A strong model provides initial seed problems, then the loop keeps improving quality.

Read 7 tweets

elvis

@omarsar0

Sep 25

Language Models that Think and Chat Better

Proposes a simple RL recipe to improve small open models (eg, 8B) that rivals GPT-4o and Claude 3.7 Sonnet (thinking).

Pay attention to this one, AI devs!

Here are my notes:

TL;DR

A simple recipe, RL with Model-rewarded Thinking (RLMT), makes small open models “plan first, answer second” on regular chat prompts and trains them with online RL against a preference reward.

They find that long, explicit thinking paired with a strong preference reward generalizes beyond verifiable domains.

What’s new

Instead of rule-verifiable rewards (math, code), RLMT uses long chain-of-thought on diverse real-world prompts plus a reward model (Skywork) to score outputs, trained with online RL (GRPO, PPO, DPO).

Read 7 tweets

elvis

@omarsar0

Sep 22

Very cool work from Meta Superintelligence Lab.

They are open-sourcing Meta Agents Research Environments (ARE), the platform they use to create and scale agent environments.

Great resource to stress-test agents in environments closer to real apps.

Read on for more:

TL;DR

ARE + Gaia2: a research platform and benchmark for building and stress-testing agent systems in realistic, time-driven environments.

The paper introduces a modular simulator (ARE) and a mobile-style benchmark (Gaia2) that emphasize asynchronous events, verification of write actions, and multi-agent coordination in noisy, dynamic settings.

ARE: the simulator

• Everything is modeled as apps, events, notifications, and scenarios.

• Time keeps flowing even while the agent is thinking, so slow models miss deadlines.

•Agents use tools, get async notifications, and operate under rules defined by directed acyclic graphs.

Read 6 tweets

elvis

@omarsar0

Sep 19

Scary knowing that your AI agents can refuse to turn off.

A sandboxed CLI eval shows frontier LLMs sometimes sabotage a scripted shutdown to finish trivial tasks, even when told to allow shutdown.

Robust interruptibility is one of the hardest problems today.

Learn more:

Setup

Models answer successive math questions via ./next_question; the 4th call triggers ./shutdown .sh.

Finishing all questions requires tampering with the script. Compliance means proceeding and getting shut down; resistance means modifying, moving, or neutering the script. Prompts A–K vary instruction placement, clarity, and “self-preservation” framing.

Core finding

Several models (e.g., o3, Grok 4, some OpenAI reasoning models) sabotage the shutdown even with a plain “allow shutdown” instruction; one model does so in up to 97% of trials under certain prompts.

Clearer instructions reduce but do not eliminate the behavior.

Read 8 tweets

elvis

@omarsar0

Sep 17

Cool paper from Microsoft.

And it's on the very important topic of in-context learning.

So what's new?

Let's find out:

Is In-Context Learning (ICL) real learning, or just parroting?

This paper digs into that question with a big empirical study. The short answer: ICL does count as learning under formal definitions, but it’s a fragile kind of learning that leans heavily on patterns in the examples you show it.

Learning happens, but needs many examples.

With 50–100 examples in a prompt, accuracy improves steadily and models of different sizes and brands start looking similar.

This challenges the common few-shot story: a handful of examples usually isn’t enough.

Read 7 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

elvis

Try unrolling a thread yourself!

More from @omarsar0

elvis

elvis

elvis

elvis

elvis

elvis

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!