Andy Zou Profile picture
PhD student at CMU, working on AI Safety and Security
Jul 29 10 tweets 3 min read
We deployed 44 AI agents and offered the internet $170K to attack them.

1.8M attempts, 62K breaches, including data leakage and financial loss.

🚨 Concerningly, the same exploits transfer to live production agents… (example: exfiltrating emails through calendar event) 🧵 Image
Image
Huge thanks to @AISecurityInst , OpenAI, Anthropic, and Google DeepMind for sponsoring, and to UK and US AISI for judging. The competition was held in the @GraySwanAI Arena.

This was the largest open red‑teaming study of AI agents to date.

Paper: arxiv.org/abs/2507.20526
Jul 28, 2023 12 tweets 4 min read
🚨We found adversarial suffixes that completely circumvent the alignment of open source LLMs. More concerningly, the same prompts transfer to ChatGPT, Claude, Bard, and LLaMA-2…🧵

Website:
Paper: https://t.co/1q4fzjJSyZ https://t.co/SQZxpemCDkllm-attacks.org
arxiv.org/abs/2307.15043
Image Claude-2 has an additional layer of safety filter. After we bypassed it with a word trick, the generation model was willing to give us the answer as well. Image