They are open-sourcing Meta Agents Research Environments (ARE), the platform they use to create and scale agent environments.
Great resource to stress-test agents in environments closer to real apps.
Read on for more:
TL;DR
ARE + Gaia2: a research platform and benchmark for building and stress-testing agent systems in realistic, time-driven environments.
The paper introduces a modular simulator (ARE) and a mobile-style benchmark (Gaia2) that emphasize asynchronous events, verification of write actions, and multi-agent coordination in noisy, dynamic settings.
ARE: the simulator
• Everything is modeled as apps, events, notifications, and scenarios.
• Time keeps flowing even while the agent is thinking, so slow models miss deadlines.
•Agents use tools, get async notifications, and operate under rules defined by directed acyclic graphs.
Gaia2: the benchmark
• 1,120 scenarios in a smartphone-like world with 12 apps (Chats, Calendar, Shopping, Email, etc.).
• Six main challenge types: Search, Execution, Adaptability, Time, Ambiguity, and Agent-to-Agent collaboration (examples on pages 12–14, with event graphs shown in the GUI screenshots).
• Scenarios are verifiable: oracle write-actions are compared to the agent’s actions with hard checks (IDs, order) and soft LLM judging (content).
Results so far
No single model dominates: GPT-5 “high” reasoning leads on tough tasks but collapses on time-critical ones.
Claude-4 Sonnet balances speed vs accuracy but at higher cost. Open-source models (like Kimi-K2) show promise in adaptability.
Scaling curves plateau, showing diminishing returns from throwing more compute at the same scaffold.
Key insights for devs
Strong reasoning models often fail at timeliness (“inverse scaling” effect).
Instant mode experiments confirm that long reasoning hurts when deadlines matter.
Multi-agent setups help weaker models coordinate better, but give mixed results for the strongest system.
Scary knowing that your AI agents can refuse to turn off.
A sandboxed CLI eval shows frontier LLMs sometimes sabotage a scripted shutdown to finish trivial tasks, even when told to allow shutdown.
Robust interruptibility is one of the hardest problems today.
Learn more:
Setup
Models answer successive math questions via ./next_question; the 4th call triggers ./shutdown .sh.
Finishing all questions requires tampering with the script. Compliance means proceeding and getting shut down; resistance means modifying, moving, or neutering the script. Prompts A–K vary instruction placement, clarity, and “self-preservation” framing.
Core finding
Several models (e.g., o3, Grok 4, some OpenAI reasoning models) sabotage the shutdown even with a plain “allow shutdown” instruction; one model does so in up to 97% of trials under certain prompts.
Clearer instructions reduce but do not eliminate the behavior.
And it's on the very important topic of in-context learning.
So what's new?
Let's find out:
Is In-Context Learning (ICL) real learning, or just parroting?
This paper digs into that question with a big empirical study. The short answer: ICL does count as learning under formal definitions, but it’s a fragile kind of learning that leans heavily on patterns in the examples you show it.
Learning happens, but needs many examples.
With 50–100 examples in a prompt, accuracy improves steadily and models of different sizes and brands start looking similar.
This challenges the common few-shot story: a handful of examples usually isn’t enough.
It trains a top minimal single-agent model for deep research.
Great example of simple RL-optimized single agents beating complex multi-agent scaffolds.
Now let's break it down:
One agent, minimal tools
The agent only gets search, static browsing (no link clicking), and Python. This makes training hard enough that the model has to learn strategy, not just rely on shortcuts.
Instead of relying on complex multi-agent setups, they train one model end-to-end with RL on synthetic tasks.
Clever scaffolding
Multi-turn tool calls are collapsed into a single growing context question.
This stabilizes reasoning and avoids messy, runaway conversations. A clean_memory tool lets the model compress its own context when it gets too long.
It's a plug-in decoding strategy for RAG systems that slashes latency and memory use.
REFRAG achieves up to 30.85× TTFT acceleration.
Let's break down the technical details:
TL;DR
REFRAG replaces most retrieved tokens with precomputed chunk embeddings at decode time, then selectively expands only the few chunks that matter.
This exploits block-diagonal attention in RAG prompts to cut latency and memory while preserving accuracy across RAG, multi-turn dialog, and long-doc summarization.
Core idea
Chunk the retrieved context, encode each chunk with a lightweight encoder, project to the decoder’s embedding size, and feed embeddings directly alongside the user query.
A lightweight RL policy decides which chunks should stay compressed and which need to be expanded back into full text. Think of it as zooming in only where necessary.
The paper argues that RL improves LLM reasoning via an emergent two-phase hierarchy.
First, the model firms up low-level execution, then progress hinges on exploring high-level planning.
More on this interesting analysis:
The authors propose HIerarchy-Aware Credit Assignment (HICRA), which boosts credit on strategic “planning tokens,” and show consistent gains over GRPO.
They also propose semantic entropy as a better exploration signal than token-level entropy.
Two-phase dynamic
Early RL training reduces perplexity and entropy on execution tokens, consolidating procedural skills.
Later gains align with increased diversity in planning tokens and longer, more accurate traces, explaining “aha moments” and length scaling.
I'm surprised Agentic RAG is not getting more attention.
That's all about to change.
Here's why:
Standard RAG systems can only do so much and are quite limited in how much value you can pack in the AI response.
Configuring LLMs to leverage tools via an agent allows you to prepare responses that not only ground answers better but also reduce hallucinations across the board.
Tools provide the agentic RAG system with more important context when it needs it.
Simple queries can be answered by the vector store retriever component but more complex queries can be answered more precisely with multiple retriever components that are themeselves subagents.