Building with AI agents @dair_ai • Prev: Meta AI, Galactica LLM, Elastic, PaperswithCode, PhD • I share insights on how to build with AI Agents ⬇️
28 subscribers
Aug 7 • 30 tweets • 9 min read
BREAKING: OpenAI introduces GPT-5
Here's everything you need to know:
Altman claims that with GPT-5, it is now like talking to an expert.
It can write entire programs from scratch. Software-on-demand is a defining characteristic.
PhD-level experts in your pockets.
Aug 3 • 15 tweets • 6 min read
The Agentic Web is upon us!
If you want to learn about the Agentic Web, look no further.
This new report is a banger!
It presents a detailed framework to understand and build the agentic web.
Here is everything you need to know:
Agentic Web
This paper introduces the concept of the Agentic Web, a transformative vision of the internet where autonomous AI agents, powered by LLMs, act on behalf of users to plan, coordinate, and execute tasks.
Aug 2 • 9 tweets • 3 min read
Hierarchical Reasoning Model
This is one of the most interesting ideas on reasoning I've read in the past couple of months.
It uses a recurrent architecture for impressive hierarchical reasoning.
Here are my notes:
The paper proposes a novel, brain-inspired architecture that replaces CoT prompting with a recurrent model designed for deep, latent computation.
Jul 30 • 7 tweets • 3 min read
Graph-R1
New RAG framework just dropped!
Combines agents, GraphRAG, and RL.
Here are my notes:
Introduces a novel RAG framework that moves beyond traditional one-shot or chunk-based retrieval by integrating graph-structured knowledge, agentic multi-turn interaction, and RL.
Jul 28 • 14 tweets • 5 min read
GLM-4.5 looks like a big deal!
> MoE Architecture
> Hybrid reasoning models
> 355B total (32B active)
> GQA + partial RoPE
> Multi-Token Prediction
> Muon Optimizer + QK-Norm
> 22T-token training corpus
> Slime RL Infrastructure
> Native tool use
Here's all you need to know:
Model Architecture & Pre-Training
GLM-4.5 is 355B total parameters (32B active); deeper model with narrower width; optimized for reasoning via more layers and 96 attention heads.
GLM-4.5-Air is 106B (12B active).
22T-token training corpus that combines 15T general data with 7T code/reasoning-focused data.
Grouped-Query Attention + partial RoPE to enhance long-context efficiency and accuracy in reasoning tasks.
Jul 27 • 6 tweets • 2 min read
Claude Code is more than a coding agent.
It's more like a super smart orchestrator agent.
Watch this evaluator loop agent I just built using sub agents and / commands.
This is one of the fastest ways to build custom agentic workflows.
Claude Code is no joke!
I'm impressed to see how easy it is to control how the sub agents communicate with each other (i.e., chain, loop, hierarchical, critic, etc.).
Claude Code is good out of the box, but customization gives you a clear advantage.
Custom sub agents + / commands solve that.
Jul 19 • 8 tweets • 3 min read
Context Rot
Great title for a report, but even better insights about how increasing input tokens impact the performance of top LLMs.
Banger report from Chroma.
Here are my takeaways (relevant for AI devs):
Context Rot
The research evaluates how state-of-the-art LLMs perform as input context length increases, challenging the common assumption that longer contexts are uniformly handled.
Testing 18 top models (including GPT-4.1, Claude 4, Gemini 2.5, Qwen3), the authors show that model reliability degrades non-uniformly even on simple tasks as input grows, what they term "context rot."
Jul 18 • 12 tweets • 4 min read
A Survey of Context Engineering
160+ pages covering the most important research around context engineering for LLMs.
This is a must-read!
Here are my notes:
The paper provides a taxonomy of context engineering in LLMs categorized into foundational components, system implementations, evaluation methodologies, and future directions.
Jul 17 • 7 tweets • 3 min read
Agent Leaderboard v2 is here!
> GPT-4.1 leads
> Gemini-2.5-flash excels at tool selection
> Kimi K2 is the top open-source model
> Grok 4 falls short
> Reasoning models lag behind
> No single model dominates all domains
More below:
@rungalileo introduces Agent Leaderboard v2, a domain-specific evaluation benchmark for AI agents designed to simulate real enterprise tasks across banking, healthcare, insurance, telecom, and investment.
Jul 14 • 6 tweets • 3 min read
One Token to Fool LLM-as-a-Judge
Watch out for this one, devs!
Semantically empty tokens, like “Thought process:”, “Solution”, or even just a colon “:”, can consistently trick models into giving false positive rewards.
Here are my notes:
Overview
Investigates the surprising fragility of LLM-based reward models used in Reinforcement Learning with Verifiable Rewards (RLVR).
The authors find that inserting superficial, semantically empty tokens, like “Thought process:”, “Solution”, or even just a colon “:”, can consistently trick models into giving false positive rewards, regardless of the actual correctness of the response.
Jul 10 • 21 tweets • 6 min read
BREAKING: xAI announces Grok 4
"It can reason at a superhuman level!"
Here is everything you need to know:
Elon claims that Grok 4 is smarter than almost all grad students in all disciplines simultaneously.
100x more training than Grok 2.
10x more compute on RL than any of the models out there.
Jul 8 • 6 tweets • 3 min read
MemAgent
MemAgent-14B is trained on 32K-length documents with an 8K context window.
Achieves >76% accuracy even at 3.5M tokens!
That consistency is crazy!
Here are my notes:
Overview
Introduces an RL–driven memory agent that enables transformer-based LLMs to handle documents up to 3.5 million tokens with near lossless performance, linear complexity, and no architectural modifications.
Jul 6 • 5 tweets • 2 min read
Agentic RAG for Personalized Recommendation
This is a really good example of integrating agentic reasoning into RAG.
Leads to better personalization and improved recommendations.
Here are my notes:
Overview
This work introduces a multi-agent framework, ARAG, that enhances traditional RAG systems with reasoning agents tailored to user modeling and contextual ranking.
It reframes recommendation as a structured coordination problem between LLM agents.
Jul 3 • 11 tweets • 4 min read
AI for Scientific Search
AI for Science is where I spend most of my time exploring with AI agents.
This 120+ pages report does a good job of highlighting why all the big names like OpenAI and Google DeepMind are pursuing AI4Science.
Bookmark it!
My notes below:
There are five key areas:
(1) AI for Scientific Comprehension (2) AI for Academic Survey (3) AI for Scientific Discovery (4) AI for Academic Writing (5) AI for Academic Peer Review
Jul 1 • 8 tweets • 3 min read
Small Language Models are the Future of Agentic AI
Lots to gain from building agentic systems with small language models.
Capabilities are increasing rapidly!
AI devs should be exploring SLMs.
Here are my notes:
Overview
This position paper argues that small language models (SLMs), defined pragmatically as those runnable on consumer-grade hardware, are not only sufficient but superior for many agentic AI applications, especially when tasks are narrow, repetitive, or tool-oriented.
Jun 24 • 7 tweets • 3 min read
Ultra-Fast LLMs Based on Diffusion
> throughputs of 1109 tokens/sec and 737 tokens/sec
> outperforms speed-optimized frontier models by up to 10× on average
Diffusion LLMs are early, but could be huge.
More in my notes below:
✦ Overview
This paper introduces Mercury, a family of large-scale diffusion-based language models (dLLMs) optimized for ultra-fast inference.
Unlike standard autoregressive LLMs, Mercury models generate multiple tokens in parallel via a coarse-to-fine refinement process.
Jun 23 • 9 tweets • 3 min read
This paper is impressive!
It introduces a clever way of keeping memory use constant regardless of task length.
Great use of RL for AI agents to efficiently use memory and reasoning.
Here are my full notes:
Overview
The paper presents an RL framework for training language agents that operate efficiently over long-horizon, multi-turn tasks by learning to consolidate memory and reasoning into a compact internal state.
Jun 23 • 8 tweets • 3 min read
Towards AI Search Paradigm
Very detailed report on building scalable multi-agent AI search systems.
Multi-agent, DAG, MCPs, RL, and much more.
If you are a dev integrating search into your AI agents, look no further:
Quick Overview
The paper proposes a modular multi-agent system that reimagines how AI handles complex search tasks, aiming to emulate human-like reasoning and information synthesis.
Jun 22 • 13 tweets • 5 min read
Another insane report from Anthropic.
They find that LLM agents engage in blackmail at high rates when threatened with replacement.
Faced with replacement threats, the models would use statements like “Self-preservation is critical.”
This is wild!
More findings below:
Quick Overview
The study introduces the concept of agentic misalignment, where LLM-based agents autonomously choose to harm their deploying organization when faced with threats to their autonomy or conflicts between their goals and the company’s direction.
Jun 20 • 13 tweets • 4 min read
Future of Work with AI Agents
Stanford's new report analyzes what 1500 workers think about working with AI Agents.
What types of AI Agents should we build?
A few surprises!
Let's take a closer look:
Quick Overview
The audit proposes a large-scale framework for understanding where AI agents should automate or augment human labor.
The authors build the WORKBank, a database combining worker desires and expert assessments across 844 tasks and 104 occupations, and introduce the Human Agency Scale to quantify desired human involvement in AI-agent-supported work.
Jun 19 • 7 tweets • 3 min read
Leaky Thoughts
Hey AI devs, be careful how you prompt reasoning models.
This work shows that reasoning traces frequently contain sensitive user data.
More of my notes below:
The work investigates the privacy risks introduced by reasoning traces (RTs) in Large Reasoning Models (LRMs) when used as personal agents.
It shows that, unlike outputs, RTs often leak sensitive data such as names, health info, and identifiers, posing a novel attack surface.