elvis Profile picture
Building with AI agents @dair_ai • Prev: Meta AI, Galactica LLM, Elastic, PaperswithCode, PhD • I share insights on how to build with LLMs & AI Agents ⬇️
28 subscribers
Jun 24 7 tweets 3 min read
Ultra-Fast LLMs Based on Diffusion

> throughputs of 1109 tokens/sec and 737 tokens/sec
> outperforms speed-optimized frontier models by up to 10× on average

Diffusion LLMs are early, but could be huge.

More in my notes below: Image ✦ Overview

This paper introduces Mercury, a family of large-scale diffusion-based language models (dLLMs) optimized for ultra-fast inference.

Unlike standard autoregressive LLMs, Mercury models generate multiple tokens in parallel via a coarse-to-fine refinement process. Image
Jun 23 9 tweets 3 min read
This paper is impressive!

It introduces a clever way of keeping memory use constant regardless of task length.

Great use of RL for AI agents to efficiently use memory and reasoning.

Here are my full notes: Image Overview

The paper presents an RL framework for training language agents that operate efficiently over long-horizon, multi-turn tasks by learning to consolidate memory and reasoning into a compact internal state.
Jun 23 8 tweets 3 min read
Towards AI Search Paradigm

Very detailed report on building scalable multi-agent AI search systems.

Multi-agent, DAG, MCPs, RL, and much more.

If you are a dev integrating search into your AI agents, look no further: Image Quick Overview

The paper proposes a modular multi-agent system that reimagines how AI handles complex search tasks, aiming to emulate human-like reasoning and information synthesis. Image
Jun 22 13 tweets 5 min read
Another insane report from Anthropic.

They find that LLM agents engage in blackmail at high rates when threatened with replacement.

Faced with replacement threats, the models would use statements like “Self-preservation is critical.”

This is wild!

More findings below: Image Quick Overview

The study introduces the concept of agentic misalignment, where LLM-based agents autonomously choose to harm their deploying organization when faced with threats to their autonomy or conflicts between their goals and the company’s direction.
Jun 20 13 tweets 4 min read
Future of Work with AI Agents

Stanford's new report analyzes what 1500 workers think about working with AI Agents.

What types of AI Agents should we build?

A few surprises!

Let's take a closer look: Image Quick Overview

The audit proposes a large-scale framework for understanding where AI agents should automate or augment human labor.

The authors build the WORKBank, a database combining worker desires and expert assessments across 844 tasks and 104 occupations, and introduce the Human Agency Scale to quantify desired human involvement in AI-agent-supported work.Image
Jun 19 7 tweets 3 min read
Leaky Thoughts

Hey AI devs, be careful how you prompt reasoning models.

This work shows that reasoning traces frequently contain sensitive user data.

More of my notes below: Image The work investigates the privacy risks introduced by reasoning traces (RTs) in Large Reasoning Models (LRMs) when used as personal agents.

It shows that, unlike outputs, RTs often leak sensitive data such as names, health info, and identifiers, posing a novel attack surface. Image
Jun 18 7 tweets 3 min read
From Bytes to Ideas

Avoids using predefined vocabs and memory-heavy embedding tables.

Instead, it uses Autoregressive U-Nets to embed information directly from raw bytes.

This is huge! Enables infinite vocab size and more.

More in my notes below: Image Quick Overview

It proposes AU-Net, a hierarchical byte-level language model that internalizes tokenization by learning to embed text from raw bytes through a multiscale, autoregressive U-Net architecture. Image
Jun 17 8 tweets 4 min read
Providing “cognitive tools” to GPT-4.1 increases performance on AIME2024 from 26.7% to 43.3%.

Damn!

That's very close to the performance of o1-preview.

Reasoning as a tool goes hard!

Here are my notes: Image Quick Overview

Proposes a modular, tool-based approach to eliciting reasoning in LLMs, inspired by cognitive science.

Rather than relying solely on RL or chain-of-thought (CoT) prompting, the authors introduce a framework where the LLM calls self-contained "cognitive tools" to modularize and scaffold internal reasoning.Image
Jun 16 8 tweets 3 min read
Enhancing RAG with Application-Aware Reasoning

Neat trick to improve RAG systems: give it the relevant knowledge and show it how to apply it.

Very simple and effective!

This approach also works well with AI agents.

Pay attention, AI devs.

Here are my notes: Image Quick Overview

It introduces RAG+, a modular framework that improves traditional RAG systems by explicitly incorporating application-level reasoning into the retrieval and generation pipeline.

It bridges retrieval and generation with an application-aware stage. Image
Jun 14 8 tweets 3 min read
Anthropic is killing it with these technical posts.

If you're an AI dev, stop what you are doing and go read this.

It shows, in great detail, how to implement an effective multi-agent research system.

Pay attention to these key parts: Image Anthropic shares how they built Claude's new multi-agent Research feature, an architecture where a lead Claude agent spawns and coordinates subagents to explore complex queries in parallel.

They use the orchestrator-worker architecture.
Jun 13 7 tweets 2 min read
Deep Research Agent for Large Systems Code

Nice paper from Microsoft!

Builds a deep research agent for large systems codebases.

Lots of interesting tricks for handling very large codebases on this one.

Here are my notes: Image Quick Overview

This work introduces Code Researcher, a deep research agent designed for debugging large-scale systems code.

The agent performs multi-step reasoning over crash reports, system semantics, and commit histories to synthesize crash-resolving patches. Image
Jun 13 6 tweets 3 min read
TableRAG

A new RAG framework for heterogeneous document reasoning.

My notes below: Image TableRAG tackles a core limitation of existing RAG approaches: their inability to reason effectively over heterogeneous documents that combine both unstructured text and structured tables. Image
Jun 11 9 tweets 4 min read
NEW: Meta releases V-JEPA 2, their new world model!

Foundation world models aim to accelerate physical AI, the next frontier.

Why is this a big deal?

Let's break it down: Image Quick Overview

V-JEPA 2, a scalable joint-embedding predictive architecture for self-supervised video learning, targeting the goal of building a generalist world model capable of understanding, predicting, and planning in the physical world... Image
Jun 11 7 tweets 3 min read
On building your personalized deep research agents.

I recently built this deep research agentic workflow with n8n and was very impressed by the results.

Combining reasoning models + multi-agent workflows is like magic!

A few things I learned along the way: Image Most AI systems, like Deep Research and Manus AI, are just not great at generating personalized, polished outputs.

The outputs are usually generic in format and need some polishing and verification thereafter. Good luck trying to modify outputs or improve those outputs via prompts.

It's not a jab at any of those tools. I still use a few of them for different things. But if you are frustrated about the results, learn to build the agentic system yourself. I promise you, it is not that hard.
Jun 10 8 tweets 3 min read
Reinforcement Pre-Training

New pre-training paradigm for LLMs just landed on arXiv!

It incentivises effective next-token reasoning with RL.

This unlocks richer reasoning capabilities using only raw text and intrinsic RL signals.

A must-read! Bookmark it!

Here are my notes: Image Paper Overview

This paper introduces Reinforcement Pre-Training (RPT), a novel paradigm that bridges LLMs pretraining and RL by reinterpreting next-token prediction as a reasoning task rewarded via verifiable correctness... Image
Jun 7 8 tweets 4 min read
The Illusion of Thinking in LLMs

Apple researchers discuss the strengths and limitations of reasoning models.

Apparently, reasoning models "collapse" beyond certain task complexities.

Lots of important insights on this one. (bookmark it!)

Here are my notes: Image Paper Overview

Investigates the capabilities and limitations of frontier Large Reasoning Models (LRMs) like Claude 3.7, DeepSeek-R1, and OpenAI’s o-series by systematically analyzing their performance on reasoning tasks as a function of problem complexity. Image
Jun 5 7 tweets 3 min read
Knowledge or Reasoning?

Evaluation matters, and even more so when using reasoning LLMs.

Look at final response accuracy, but also pay attention to thinking trajectories.

Lots of good findings on this one.

Here are my notes: Image Summary

Introduces a fine-grained evaluation framework to dissect LLM thinking into two components: knowledge correctness and reasoning informativeness, measured via Knowledge Index (KI) and Information Gain (InfoGain), respectively. Image
Jun 2 7 tweets 3 min read
Reasoning Models Thinking Slow and Fast at Test Time

Another super cool work on improving reasoning efficiency in LLMs.

They show that slow-then-fast reasoning outperforms other strategies.

Here are my notes: Image What's the high level?

Introduces a universal framework, AlphaOne (α1), for modulating the reasoning progress of large reasoning models (LRMs) during inference.

Rather than relying on rigid or automatic schedules, α1 explicitly controls when and how models engage in “slow thinking” using a tunable parameter α.Image
May 31 10 tweets 4 min read
Open-Ended Evolution of Self-Improving Agents

Can AI systems endlessly improve themselves?

This work shows the potential of self-improving AI, inspired by biological evolution and open-ended exploration.

This is a must-read!

Here are my notes: Image What's the high level?

This work presents the Darwin Gödel Machine (DGM), a system that advances the vision of self-improving AI by combining self-referential code modification with open-ended evolutionary search...
May 30 5 tweets 2 min read
Building Production-Grade Conversational Agents with Workflow Graphs

Uses DAG to design robust and complex agentic systems.

If you're building AI agents, this is worth a read.

Here are my notes: Image Quick overview

This paper presents a pragmatic, production-ready framework for building LLM-powered conversational agents using workflow graphs, with a specific focus on e-commerce scenarios.

Instead of relying solely on end-to-end generation, the authors design agents using a directed acyclic graph (DAG), enabling flexible yet controllable interactions that adhere to strict business rules and format constraints.
May 29 8 tweets 3 min read
An Operating System for Memory-Augmented Generation in LLMs

Lots of great ideas on how to think about memory and better manage it in LLM-based agents.

Must read!

Here are my notes: Image It introduces a unified operating system for managing memory in LLMs, addressing a key limitation in current architectures: their lack of structured, persistent, and governable memory... Image