elvis Profile picture
Sep 6 7 tweets 3 min read Read on X
Everyone is talking about this new OpenAI paper.

It's about why LLMs hallucinate.

You might want to bookmark this one.

Let's break down the technical details: Image
Quick Overview

The paper argues that hallucinations are not mysterious glitches but the predictable result of how LLMs are trained and evaluated.

Pretraining creates statistical pressure to make errors, and post-training benchmarks often reward confident guessing over honest uncertainty.

The fix is to realign mainstream evaluations to stop penalizing abstentions.Image
Pretraining inevitably produces some errors

Even if you trained on flawless text, the way models learn guarantees they’ll still slip up sometimes.

That’s because the training goal pushes them to give answers instead of saying “I don’t know.”

The calibration histograms below illustrate that GPT-4 style base models are well calibrated prior to RL, consistent with this claim.Image
Arbitrary facts drive a floor on hallucinations.

Details like birthdays or one-off events show up rarely in training data. If a fact appears only once, the model is just as likely to guess wrong later.

So for these “one-shot facts,” hallucinations are baked in. Image
Weak models add to the problem.

When the model family cannot represent the needed distinctions, errors persist.

The paper formalizes this via an agnostic-learning bound and gives simple cases like multiple choice, where even optimal thresholding leaves a fixed error tied to model capacity, with an example showing classic n-gram models must fail on certain context dependencies.Image
Post-training often reinforces guessing

Most benchmarks score models only on right vs. wrong answers.

Saying “I don’t know” gets you zero, while making a confident guess could get you a point.

That system rewards bluffing, so models learn to “sound sure” even when they’re not.

The authors survey widely used leaderboards and find abstentions largely penalized, explaining why overconfident hallucinations persist despite mitigation efforts.Image
The fix is to reward honesty

The authors suggest changing benchmarks so models aren’t punished for admitting uncertainty.

If we add clear rules about when to guess and when to abstain, models will learn to only answer when they’re reasonably confident.

This promotes behavioral calibration, where models choose between answering and abstaining according to the target confidence, and should steer the field toward more trustworthy systems.

Paper:
cdn.openai.com/pdf/d04913be-3…Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with elvis

elvis Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @omarsar0

Sep 6
Universal Deep Research

NVIDIA recently published another banger tech report!

The idea is simple: allow users to build their own custom, model-agnostic deep research agents with little effort.

Here is what you need to know: Image
Overview

Universal Deep Research (UDR) proposes a general, model-agnostic deep-research agent that lets users bring their own model and strategy.

Instead of a fixed pipeline, UDR compiles natural-language research strategies into executable code, runs them in a sandbox, and emits structured progress notifications before returning a final report.Image
Motivation

Current deep-research tools hard-code strategy and model choice, limiting source prioritization, domain-specific workflows, and model swap-ability.

UDR targets all three gaps by separating the research strategy from the underlying model. Image
Read 8 tweets
Sep 5
Cool research from Microsoft!

They release rStar2-Agent, a 14B math reasoning models trained with agentic RL.

It reaches frontier-level math reasoning in just 510 RL training steps.

Here are my notes: Image
Quick Overview

rStar2-Agent (Microsoft Research). A 14B math-reasoning model trained with agentic RL that learns to think smarter by using a Python tool environment, not just longer CoT.

It introduces GRPO-RoC, a rollout strategy that filters noisy successful traces, plus infrastructure for massive, low-latency tool execution.Image
Method

GRPO-RoC oversamples rollouts, then keeps only the cleanest correct ones while preserving diverse failures, reducing tool-call errors and formatting issues during training. Image
Read 7 tweets
Aug 31
Overview of Self-Evolving Agents

There is a huge interest in moving from hand-crafted agentic systems to lifelong, adaptive agentic ecosystems.

What's the progress, and where are things headed?

Let's find out: Image
This survey defines self-evolving AI agents and argues for a shift from static, hand-crafted systems to lifelong, adaptive agentic ecosystems.

It maps the field’s trajectory, proposes “Three Laws” to keep evolution safe and useful, and organizes techniques across single-agent, multi-agent, and domain-specific settings.
Paradigm shift and guardrails

The paper frames four stages: Model Offline Pretraining → Model Online Adaptation → Multi-Agent Orchestration → Multi-Agent Self-Evolving.

It introduces three guiding laws for evolution: maintain safety, preserve or improve performance, and then autonomously optimize.Image
Read 10 tweets
Aug 28
Memory-R1

Another really cool paper showing how RL can enhance an LLM's agentic and memory capabilities.

Great read for AI devs.

Here are my notes: Image
Overview

A framework that teaches LLM agents to decide what to remember and how to use it.

Two RL-fine-tuned components work together: a Memory Manager that learns CRUD-style operations on an external store and an Answer Agent that filters retrieved memories via “memory distillation” before answering.Image
Active memory control with RL

The Memory Manager selects ADD, UPDATE, DELETE, or NOOP after a RAG step and edits entries accordingly; training with PPO or GRPO uses downstream QA correctness as the reward, removing the need for per-edit labels. Image
Read 7 tweets
Aug 27
Don't sleep on small models!

Anemoi is the latest multi-agent system that proves small models pack a punch when combined effectively.

GPT-4.1-mini (for planning) and GPT-4o (for worker agents) surpass the strongest open-source baseline on GAIA.

A must-read for devs: Image
Quick Overview

Anemoi is a semi-centralized generalist multi-agent system powered by an A2A communication MCP server from @Coral_Protocol.

Anemoi replaces purely centralized, context-stuffed coordination with an A2A communication server (MCP) that lets agents talk directly, monitor progress, refine plans, and reach consensus.Image
Design

A semi-centralized planner proposes an initial plan, while worker agents (web, document processing, reasoning/coding) plus critique and answer-finding agents collaborate via MCP threads.

Agents communicate directly with each other.

All participants can list agents, create threads, send messages, wait for mentions, and update plans as execution unfolds.Image
Read 8 tweets
Aug 27
Efficient Language Model with PostNAS

NVIDIA's recent research on LLMs has been fantastic.

Jet-Nemotron is the latest in efficient language models, which significantly improves generation throughput.

Here are my notes: Image
A hybrid-architecture LM family built by “adapting after pretraining.”

Starting from a frozen full-attention model, the authors search where to keep full attention, which linear-attention block to use, and which hyperparameters match hardware limits.

The result, Jet-Nemotron-2B/4B, matches or surpasses popular full-attention baselines while massively increasing throughput on long contexts.Image
PostNAS pipeline

Begins with a pre-trained full-attention model and freezes MLPs, then proceeds in four steps:

1. Learn optimal placement or removal of full-attention layers
2. Select a linear-attention block
3. Design a new attention block
4. Run a hardware-aware hyperparameter searchImage
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(