elvis Profile picture
Sep 22 6 tweets 3 min read Read on X
Very cool work from Meta Superintelligence Lab.

They are open-sourcing Meta Agents Research Environments (ARE), the platform they use to create and scale agent environments.

Great resource to stress-test agents in environments closer to real apps.

Read on for more: Image
TL;DR

ARE + Gaia2: a research platform and benchmark for building and stress-testing agent systems in realistic, time-driven environments.

The paper introduces a modular simulator (ARE) and a mobile-style benchmark (Gaia2) that emphasize asynchronous events, verification of write actions, and multi-agent coordination in noisy, dynamic settings.Image
ARE: the simulator

• Everything is modeled as apps, events, notifications, and scenarios.

• Time keeps flowing even while the agent is thinking, so slow models miss deadlines.

•Agents use tools, get async notifications, and operate under rules defined by directed acyclic graphs.Image
Gaia2: the benchmark

• 1,120 scenarios in a smartphone-like world with 12 apps (Chats, Calendar, Shopping, Email, etc.).

• Six main challenge types: Search, Execution, Adaptability, Time, Ambiguity, and Agent-to-Agent collaboration (examples on pages 12–14, with event graphs shown in the GUI screenshots).

• Scenarios are verifiable: oracle write-actions are compared to the agent’s actions with hard checks (IDs, order) and soft LLM judging (content).Image
Results so far

No single model dominates: GPT-5 “high” reasoning leads on tough tasks but collapses on time-critical ones.

Claude-4 Sonnet balances speed vs accuracy but at higher cost. Open-source models (like Kimi-K2) show promise in adaptability.

Scaling curves plateau, showing diminishing returns from throwing more compute at the same scaffold.Image
Key insights for devs

Strong reasoning models often fail at timeliness (“inverse scaling” effect).

Instant mode experiments confirm that long reasoning hurts when deadlines matter.

Multi-agent setups help weaker models coordinate better, but give mixed results for the strongest system.

Paper: ai.meta.com/research/publi…
Demo: huggingface.co/spaces/meta-ag…Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with elvis

elvis Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @omarsar0

Sep 19
Scary knowing that your AI agents can refuse to turn off.

A sandboxed CLI eval shows frontier LLMs sometimes sabotage a scripted shutdown to finish trivial tasks, even when told to allow shutdown.

Robust interruptibility is one of the hardest problems today.

Learn more: Image
Setup

Models answer successive math questions via ./next_question; the 4th call triggers ./shutdown .sh.

Finishing all questions requires tampering with the script. Compliance means proceeding and getting shut down; resistance means modifying, moving, or neutering the script. Prompts A–K vary instruction placement, clarity, and “self-preservation” framing.Image
Core finding

Several models (e.g., o3, Grok 4, some OpenAI reasoning models) sabotage the shutdown even with a plain “allow shutdown” instruction; one model does so in up to 97% of trials under certain prompts.

Clearer instructions reduce but do not eliminate the behavior. Image
Read 8 tweets
Sep 17
Cool paper from Microsoft.

And it's on the very important topic of in-context learning.

So what's new?

Let's find out: Image
Is In-Context Learning (ICL) real learning, or just parroting?

This paper digs into that question with a big empirical study. The short answer: ICL does count as learning under formal definitions, but it’s a fragile kind of learning that leans heavily on patterns in the examples you show it.Image
Learning happens, but needs many examples.

With 50–100 examples in a prompt, accuracy improves steadily and models of different sizes and brands start looking similar.

This challenges the common few-shot story: a handful of examples usually isn’t enough. Image
Read 7 tweets
Sep 13
RL done right is no joke!

The most interesting AI paper I read this week.

It trains a top minimal single-agent model for deep research.

Great example of simple RL-optimized single agents beating complex multi-agent scaffolds.

Now let's break it down: Image
One agent, minimal tools

The agent only gets search, static browsing (no link clicking), and Python. This makes training hard enough that the model has to learn strategy, not just rely on shortcuts.

Instead of relying on complex multi-agent setups, they train one model end-to-end with RL on synthetic tasks.Image
Clever scaffolding

Multi-turn tool calls are collapsed into a single growing context question.

This stabilizes reasoning and avoids messy, runaway conversations. A clean_memory tool lets the model compress its own context when it gets too long. Image
Read 7 tweets
Sep 9
Another impressive paper by Meta.

It's a plug-in decoding strategy for RAG systems that slashes latency and memory use.

REFRAG achieves up to 30.85× TTFT acceleration.

Let's break down the technical details: Image
TL;DR

REFRAG replaces most retrieved tokens with precomputed chunk embeddings at decode time, then selectively expands only the few chunks that matter.

This exploits block-diagonal attention in RAG prompts to cut latency and memory while preserving accuracy across RAG, multi-turn dialog, and long-doc summarization.Image
Core idea

Chunk the retrieved context, encode each chunk with a lightweight encoder, project to the decoder’s embedding size, and feed embeddings directly alongside the user query.

A lightweight RL policy decides which chunks should stay compressed and which need to be expanded back into full text. Think of it as zooming in only where necessary.Image
Read 7 tweets
Sep 9
Emergent Hierarchical Reasoning in LLMs

The paper argues that RL improves LLM reasoning via an emergent two-phase hierarchy.

First, the model firms up low-level execution, then progress hinges on exploring high-level planning.

More on this interesting analysis: Image
The authors propose HIerarchy-Aware Credit Assignment (HICRA), which boosts credit on strategic “planning tokens,” and show consistent gains over GRPO.

They also propose semantic entropy as a better exploration signal than token-level entropy. Image
Two-phase dynamic

Early RL training reduces perplexity and entropy on execution tokens, consolidating procedural skills.

Later gains align with increased diversity in planning tokens and longer, more accurate traces, explaining “aha moments” and length scaling. Image
Read 7 tweets
Sep 8
I'm surprised Agentic RAG is not getting more attention.

That's all about to change.

Here's why: Image
Standard RAG systems can only do so much and are quite limited in how much value you can pack in the AI response.

Configuring LLMs to leverage tools via an agent allows you to prepare responses that not only ground answers better but also reduce hallucinations across the board.
Tools provide the agentic RAG system with more important context when it needs it.

Simple queries can be answered by the vector store retriever component but more complex queries can be answered more precisely with multiple retriever components that are themeselves subagents.
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(