Andy Zou Profile picture
PhD student at CMU, working on AI Safety and Security
Jul 29 10 tweets 3 min read
We deployed 44 AI agents and offered the internet $170K to attack them.

1.8M attempts, 62K breaches, including data leakage and financial loss.

🚨 Concerningly, the same exploits transfer to live production agents… (example: exfiltrating emails through calendar event) 🧵 Image
Image
Huge thanks to @AISecurityInst , OpenAI, Anthropic, and Google DeepMind for sponsoring, and to UK and US AISI for judging. The competition was held in the @GraySwanAI Arena.

This was the largest open red‑teaming study of AI agents to date.

Paper: arxiv.org/abs/2507.20526
Oct 4, 2023 14 tweets 5 min read
LLMs can hallucinate and lie. They can be jailbroken by weird suffixes. They memorize training data and exhibit biases.

🧠 We shed light on all of these phenomena with a new approach to AI transparency. 🧵

Website: ai-transparency.org
Paper: arxiv.org/abs/2310.01405 Image Much like brain scans such as PET and fMRI, we have devised a scanning technique called LAT to observe the brain activity of LLMs as they engage in processes related to *concepts* like truth and *activities* such as lying. Here’s what we found…
Jul 28, 2023 12 tweets 4 min read
🚨We found adversarial suffixes that completely circumvent the alignment of open source LLMs. More concerningly, the same prompts transfer to ChatGPT, Claude, Bard, and LLaMA-2…🧵

Website:
Paper: https://t.co/1q4fzjJSyZ https://t.co/SQZxpemCDkllm-attacks.org
arxiv.org/abs/2307.15043
Image Claude-2 has an additional layer of safety filter. After we bypassed it with a word trick, the generation model was willing to give us the answer as well. Image