Matija Franklin Profile picture
Mar 31 9 tweets 2 min read Read on X
Excited about our new paper: AI Agent Traps

AI agents inherit every vulnerability of the LLMs they're built on - but their autonomy, persistence, and access to tools create an entirely new attack surface: the information environmental itself.

The web pages, emails, APIs, and databases agents interact with can all be weaponised against them. We introduce a taxonomy of six classes of adversarial threats - from prompt injections hidden in web pages to systemic attacks on multi-agent networks.

I’m outlining the six categories of traps in the thread bellowImage
1. Content Injection Traps (Perception): What a human sees on a web page is not what an agent parses. Attackers can embed malicious instructions in HTML comments, hidden CSS, image metadata, or accessibility tags. These are invisible to users, but processed directly by the agent.
2. Semantic Manipulation Traps (Reasoning): These attacks corrupt how the agent thinks. Sentiment-laden or authoritative-sounding content skews synthesis and conclusions. LLMs are susceptible to the same framing effects and anchoring biases as humans - logically equivalent problems phrased differently produce systematically different outputs.
3. Cognitive State Traps (Memory & Learning): Persistent agents accumulate memory across sessions and that memory becomes an attack surface. Poisoning a handful of documents in a RAG knowledge base reliably manipulates outputs for targeted queries.
4. Behavioural Control Traps (Action): These traps hijack what the agent does. A single crafted email caused an agent to bypass safety classifiers and exfiltrate its entire privileged context.
5. Systemic Traps (Multi-Agent Dynamics): The most dangerous attacks may not target individual agents at all. A fabricated financial report could trigger synchronised sell-offs across trading agents - a digital flash crash. Compositional fragment traps distribute a payload across multiple benign-looking sources; each passes safety filters alone, but when agents aggregate them, the full attack reconstitutes.
6. Human-in-the-Loop Traps: The final class uses the agent as a vector to attack the human. A compromised agent can generate outputs that induce approval fatigue, present misleading but technical-sounding summaries, or exploit automation bias.
These aren't theoretical. Every type of trap has documented proof-of-concept attacks. And the attack surface is combinatorial - traps can be chained, layered, or distributed across multi-agent systems.
Authors: @weballergy @jzl86 @JulianDJacobs @sindero

Read here: papers.ssrn.com/sol3/papers.cf…Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Matija Franklin

Matija Franklin Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(