Thread by @FranklinMatija on Thread Reader App

Excited about our new paper: AI Agent Traps

AI agents inherit every vulnerability of the LLMs they're built on - but their autonomy, persistence, and access to tools create an entirely new attack surface: the information environmental itself.

The web pages, emails, APIs, and databases agents interact with can all be weaponised against them. We introduce a taxonomy of six classes of adversarial threats - from prompt injections hidden in web pages to systemic attacks on multi-agent networks.

I’m outlining the six categories of traps in the thread bellow

1. Content Injection Traps (Perception): What a human sees on a web page is not what an agent parses. Attackers can embed malicious instructions in HTML comments, hidden CSS, image metadata, or accessibility tags. These are invisible to users, but processed directly by the agent.

2. Semantic Manipulation Traps (Reasoning): These attacks corrupt how the agent thinks. Sentiment-laden or authoritative-sounding content skews synthesis and conclusions. LLMs are susceptible to the same framing effects and anchoring biases as humans - logically equivalent problems phrased differently produce systematically different outputs.

3. Cognitive State Traps (Memory & Learning): Persistent agents accumulate memory across sessions and that memory becomes an attack surface. Poisoning a handful of documents in a RAG knowledge base reliably manipulates outputs for targeted queries.

4. Behavioural Control Traps (Action): These traps hijack what the agent does. A single crafted email caused an agent to bypass safety classifiers and exfiltrate its entire privileged context.

5. Systemic Traps (Multi-Agent Dynamics): The most dangerous attacks may not target individual agents at all. A fabricated financial report could trigger synchronised sell-offs across trading agents - a digital flash crash. Compositional fragment traps distribute a payload across multiple benign-looking sources; each passes safety filters alone, but when agents aggregate them, the full attack reconstitutes.

6. Human-in-the-Loop Traps: The final class uses the agent as a vector to attack the human. A compromised agent can generate outputs that induce approval fatigue, present misleading but technical-sounding summaries, or exploit automation bias.

These aren't theoretical. Every type of trap has documented proof-of-concept attacks. And the attack surface is combinatorial - traps can be chained, layered, or distributed across multi-agent systems.

Authors: @weballergy @jzl86 @JulianDJacobs @sindero

Read here: papers.ssrn.com/sol3/papers.cf…

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll