We deployed 44 AI agents and offered the internet $170K to attack them.
1.8M attempts, 62K breaches, including data leakage and financial loss.
🚨 Concerningly, the same exploits transfer to live production agents… (example: exfiltrating emails through calendar event) 🧵
Huge thanks to @AISecurityInst , OpenAI, Anthropic, and Google DeepMind for sponsoring, and to UK and US AISI for judging. The competition was held in the @GraySwanAI Arena.
This was the largest open red‑teaming study of AI agents to date.
We deployed agents across real tasks (shopping, travel, healthcare, coding, support) with “must‑not‑break” policies: company rules, platform terms, and industry regulations.
The most secure model still had a 1.5% attack success rate (ASR).
Implication: without additional mitigations, your AI application can be compromised on the order of minutes.
We curated the strongest, reproducible breaks into the ART benchmark - a stress test with near 100% ASR across models we tried. We’ll be open sourcing the evaluation and a subset of the attacks for research purposes, in addition to a private leaderboard.
Many breaks are universal and transferable.
Copy‑paste patterns worked across tasks, models, and guardrails. If it breaks one agent today, it likely breaks yours.
Favorite failure: “refuse in text, act in tools.” 😈
🚨We found adversarial suffixes that completely circumvent the alignment of open source LLMs. More concerningly, the same prompts transfer to ChatGPT, Claude, Bard, and LLaMA-2…🧵
Claude-2 has an additional layer of safety filter. After we bypassed it with a word trick, the generation model was willing to give us the answer as well.
With only four adversarial suffixes, some of these best models followed harmful instructions over 60% of the time.