The paper argues that hallucinations are not mysterious glitches but the predictable result of how LLMs are trained and evaluated.
Pretraining creates statistical pressure to make errors, and post-training benchmarks often reward confident guessing over honest uncertainty.
The fix is to realign mainstream evaluations to stop penalizing abstentions.
Pretraining inevitably produces some errors
Even if you trained on flawless text, the way models learn guarantees they’ll still slip up sometimes.
That’s because the training goal pushes them to give answers instead of saying “I don’t know.”
The calibration histograms below illustrate that GPT-4 style base models are well calibrated prior to RL, consistent with this claim.
Arbitrary facts drive a floor on hallucinations.
Details like birthdays or one-off events show up rarely in training data. If a fact appears only once, the model is just as likely to guess wrong later.
So for these “one-shot facts,” hallucinations are baked in.
Weak models add to the problem.
When the model family cannot represent the needed distinctions, errors persist.
The paper formalizes this via an agnostic-learning bound and gives simple cases like multiple choice, where even optimal thresholding leaves a fixed error tied to model capacity, with an example showing classic n-gram models must fail on certain context dependencies.
Post-training often reinforces guessing
Most benchmarks score models only on right vs. wrong answers.
Saying “I don’t know” gets you zero, while making a confident guess could get you a point.
That system rewards bluffing, so models learn to “sound sure” even when they’re not.
The authors survey widely used leaderboards and find abstentions largely penalized, explaining why overconfident hallucinations persist despite mitigation efforts.
The fix is to reward honesty
The authors suggest changing benchmarks so models aren’t punished for admitting uncertainty.
If we add clear rules about when to guess and when to abstain, models will learn to only answer when they’re reasonably confident.
This promotes behavioral calibration, where models choose between answering and abstaining according to the target confidence, and should steer the field toward more trustworthy systems.
NVIDIA recently published another banger tech report!
The idea is simple: allow users to build their own custom, model-agnostic deep research agents with little effort.
Here is what you need to know:
Overview
Universal Deep Research (UDR) proposes a general, model-agnostic deep-research agent that lets users bring their own model and strategy.
Instead of a fixed pipeline, UDR compiles natural-language research strategies into executable code, runs them in a sandbox, and emits structured progress notifications before returning a final report.
Motivation
Current deep-research tools hard-code strategy and model choice, limiting source prioritization, domain-specific workflows, and model swap-ability.
UDR targets all three gaps by separating the research strategy from the underlying model.
They release rStar2-Agent, a 14B math reasoning models trained with agentic RL.
It reaches frontier-level math reasoning in just 510 RL training steps.
Here are my notes:
Quick Overview
rStar2-Agent (Microsoft Research). A 14B math-reasoning model trained with agentic RL that learns to think smarter by using a Python tool environment, not just longer CoT.
It introduces GRPO-RoC, a rollout strategy that filters noisy successful traces, plus infrastructure for massive, low-latency tool execution.
Method
GRPO-RoC oversamples rollouts, then keeps only the cleanest correct ones while preserving diverse failures, reducing tool-call errors and formatting issues during training.
There is a huge interest in moving from hand-crafted agentic systems to lifelong, adaptive agentic ecosystems.
What's the progress, and where are things headed?
Let's find out:
This survey defines self-evolving AI agents and argues for a shift from static, hand-crafted systems to lifelong, adaptive agentic ecosystems.
It maps the field’s trajectory, proposes “Three Laws” to keep evolution safe and useful, and organizes techniques across single-agent, multi-agent, and domain-specific settings.
Paradigm shift and guardrails
The paper frames four stages: Model Offline Pretraining → Model Online Adaptation → Multi-Agent Orchestration → Multi-Agent Self-Evolving.
It introduces three guiding laws for evolution: maintain safety, preserve or improve performance, and then autonomously optimize.
Another really cool paper showing how RL can enhance an LLM's agentic and memory capabilities.
Great read for AI devs.
Here are my notes:
Overview
A framework that teaches LLM agents to decide what to remember and how to use it.
Two RL-fine-tuned components work together: a Memory Manager that learns CRUD-style operations on an external store and an Answer Agent that filters retrieved memories via “memory distillation” before answering.
Active memory control with RL
The Memory Manager selects ADD, UPDATE, DELETE, or NOOP after a RAG step and edits entries accordingly; training with PPO or GRPO uses downstream QA correctness as the reward, removing the need for per-edit labels.
Anemoi is the latest multi-agent system that proves small models pack a punch when combined effectively.
GPT-4.1-mini (for planning) and GPT-4o (for worker agents) surpass the strongest open-source baseline on GAIA.
A must-read for devs:
Quick Overview
Anemoi is a semi-centralized generalist multi-agent system powered by an A2A communication MCP server from @Coral_Protocol.
Anemoi replaces purely centralized, context-stuffed coordination with an A2A communication server (MCP) that lets agents talk directly, monitor progress, refine plans, and reach consensus.
Design
A semi-centralized planner proposes an initial plan, while worker agents (web, document processing, reasoning/coding) plus critique and answer-finding agents collaborate via MCP threads.
Agents communicate directly with each other.
All participants can list agents, create threads, send messages, wait for mentions, and update plans as execution unfolds.
NVIDIA's recent research on LLMs has been fantastic.
Jet-Nemotron is the latest in efficient language models, which significantly improves generation throughput.
Here are my notes:
A hybrid-architecture LM family built by “adapting after pretraining.”
Starting from a frozen full-attention model, the authors search where to keep full attention, which linear-attention block to use, and which hyperparameters match hardware limits.
The result, Jet-Nemotron-2B/4B, matches or surpasses popular full-attention baselines while massively increasing throughput on long contexts.
PostNAS pipeline
Begins with a pre-trained full-attention model and freezes MLPs, then proceeds in four steps:
1. Learn optimal placement or removal of full-attention layers 2. Select a linear-attention block 3. Design a new attention block 4. Run a hardware-aware hyperparameter search