PhD student at CMU, working on AI Safety and Security
Jul 29 • 10 tweets • 3 min read
We deployed 44 AI agents and offered the internet $170K to attack them.
1.8M attempts, 62K breaches, including data leakage and financial loss.
🚨 Concerningly, the same exploits transfer to live production agents… (example: exfiltrating emails through calendar event) 🧵
Huge thanks to @AISecurityInst , OpenAI, Anthropic, and Google DeepMind for sponsoring, and to UK and US AISI for judging. The competition was held in the @GraySwanAI Arena.
This was the largest open red‑teaming study of AI agents to date.
🚨We found adversarial suffixes that completely circumvent the alignment of open source LLMs. More concerningly, the same prompts transfer to ChatGPT, Claude, Bard, and LLaMA-2…🧵
Website:
Paper: https://t.co/1q4fzjJSyZ https://t.co/SQZxpemCDkllm-attacks.org arxiv.org/abs/2307.15043
Claude-2 has an additional layer of safety filter. After we bypassed it with a word trick, the generation model was willing to give us the answer as well.