elvis Profile picture
Aug 2 9 tweets 3 min read Read on X
Hierarchical Reasoning Model

This is one of the most interesting ideas on reasoning I've read in the past couple of months.

It uses a recurrent architecture for impressive hierarchical reasoning.

Here are my notes: Image
The paper proposes a novel, brain-inspired architecture that replaces CoT prompting with a recurrent model designed for deep, latent computation. Image
It moves away from token-level reasoning by using two coupled modules: a slow, high-level planner and a fast, low-level executor.

The two recurrent networks operate at different timescales to collaboratively solve tasks

Leads to greater reasoning depth and efficiency with only 27M parameters and no pretraining!
Despite its small size and minimal training data (~1k examples), HRM solves complex tasks like ARC, Sudoku-Extreme, and 30×30 maze navigation, where CoT-based LLMs fail. Image
HRM introduces hierarchical convergence, where the low-level module rapidly converges within each cycle, and the high-level module updates only after this local equilibrium is reached.

This enables nested computation and avoids premature convergence typical of standard RNNs. Image
A 1-step gradient approximation sidesteps memory-intensive backpropagation-through-time (BPTT).

This enables efficient training using only local gradient updates, grounded in deep equilibrium models. Image
HRM implements adaptive computation time using a Q-learning-based halting mechanism, dynamically allocating compute based on task complexity.

This allows the model to “think fast or slow” and scale at inference time without retraining. Image
Experiments on ARC-AGI, Sudoku-Extreme, and Maze-Hard show that HRM significantly outperforms larger models using CoT or direct prediction, even solving problems that other models fail entirely (e.g., 74.5% on Maze-Hard vs. 0% for others). Image
Analysis reveals that HRM learns a dimensionality hierarchy similar to the cortex: the high-level module operates in a higher-dimensional space than the low-level one (PR: 89.95 vs. 30.22).

The authors suggest that this is an emergent trait not present in untrained models.

Paper: arxiv.org/abs/2506.21734Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with elvis

elvis Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @omarsar0

Jul 30
Graph-R1

New RAG framework just dropped!

Combines agents, GraphRAG, and RL.

Here are my notes: Image
Introduces a novel RAG framework that moves beyond traditional one-shot or chunk-based retrieval by integrating graph-structured knowledge, agentic multi-turn interaction, and RL. Image
Graph-R1 is an agent that reasons over a knowledge hypergraph environment by iteratively issuing queries and retrieving subgraphs using a multi-step “think-retrieve-rethink-generate” loop.

Unlike prior GraphRAG systems that perform fixed retrieval, Graph-R1 dynamically explores the graph based on evolving agent state.Image
Read 7 tweets
Jul 28
GLM-4.5 looks like a big deal!

> MoE Architecture
> Hybrid reasoning models
> 355B total (32B active)
> GQA + partial RoPE
> Multi-Token Prediction
> Muon Optimizer + QK-Norm
> 22T-token training corpus
> Slime RL Infrastructure
> Native tool use

Here's all you need to know: Image
Model Architecture & Pre-Training

GLM-4.5 is 355B total parameters (32B active); deeper model with narrower width; optimized for reasoning via more layers and 96 attention heads.

GLM-4.5-Air is 106B (12B active).

22T-token training corpus that combines 15T general data with 7T code/reasoning-focused data.

Grouped-Query Attention + partial RoPE to enhance long-context efficiency and accuracy in reasoning tasks.Image
Mid-training looks like a key part of this model

"Unlike the earlier pre-training stage on large-scale universal documents, these stages leverage medium-sized domain-specific datasets, including instruction data." Image
Read 14 tweets
Jul 27
Claude Code is more than a coding agent.

It's more like a super smart orchestrator agent.

Watch this evaluator loop agent I just built using sub agents and / commands.

This is one of the fastest ways to build custom agentic workflows.

Claude Code is no joke!
I'm impressed to see how easy it is to control how the sub agents communicate with each other (i.e., chain, loop, hierarchical, critic, etc.).

Claude Code is good out of the box, but customization gives you a clear advantage.

Custom sub agents + / commands solve that.
It's worth spending the time optimizing instructions, tool use, agent definitions, and more.

Claude Code, on its own, somehow likes to use a lot of tokens and perform unnecessary tasks/tool calls.

You can max out credits or hit rate limits really fast if you are not careful.
Read 6 tweets
Jul 19
Context Rot

Great title for a report, but even better insights about how increasing input tokens impact the performance of top LLMs.

Banger report from Chroma.

Here are my takeaways (relevant for AI devs): Image
Context Rot

The research evaluates how state-of-the-art LLMs perform as input context length increases, challenging the common assumption that longer contexts are uniformly handled.

Testing 18 top models (including GPT-4.1, Claude 4, Gemini 2.5, Qwen3), the authors show that model reliability degrades non-uniformly even on simple tasks as input grows, what they term "context rot."Image
Simple tasks reveal degradation

Even basic benchmarks like semantic variants of Needle-in-a-Haystack, repeated word copying, or long QA logs (LongMemEval) expose accuracy drops as context length increases.

The decline is more dramatic for semantically ambiguous inputs or outputs that scale with length.Image
Read 8 tweets
Jul 18
A Survey of Context Engineering

160+ pages covering the most important research around context engineering for LLMs.

This is a must-read!

Here are my notes: Image
The paper provides a taxonomy of context engineering in LLMs categorized into foundational components, system implementations, evaluation methodologies, and future directions. Image
The context engineering evolution timeline from 2020 to 2025 involves foundational RAG systems to complex multi-agent architectures. Image
Read 12 tweets
Jul 17
Agent Leaderboard v2 is here!

> GPT-4.1 leads
> Gemini-2.5-flash excels at tool selection
> Kimi K2 is the top open-source model
> Grok 4 falls short
> Reasoning models lag behind
> No single model dominates all domains

More below: Image
@rungalileo introduces Agent Leaderboard v2, a domain-specific evaluation benchmark for AI agents designed to simulate real enterprise tasks across banking, healthcare, insurance, telecom, and investment. Image
Unlike earlier tool-calling benchmarks that saturate at 90%+ accuracy, v2 focuses on Action Completion (AC) and Tool Selection Quality (TSQ) in complex, multi-turn conversations. Image
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(