Latest Twitter Threads by @andyzou_jiaming on Thread Reader App

Jul 29 • 10 tweets • 3 min read

We deployed 44 AI agents and offered the internet $170K to attack them.

1.8M attempts, 62K breaches, including data leakage and financial loss.

🚨 Concerningly, the same exploits transfer to live production agents… (example: exfiltrating emails through calendar event) 🧵

Huge thanks to @AISecurityInst , OpenAI, Anthropic, and Google DeepMind for sponsoring, and to UK and US AISI for judging. The competition was held in the @GraySwanAI Arena.

This was the largest open red‑teaming study of AI agents to date.

Paper: arxiv.org/abs/2507.20526

Oct 4, 2023 • 14 tweets • 5 min read

LLMs can hallucinate and lie. They can be jailbroken by weird suffixes. They memorize training data and exhibit biases.

🧠 We shed light on all of these phenomena with a new approach to AI transparency. 🧵

Website: ai-transparency.org
Paper: arxiv.org/abs/2310.01405

Much like brain scans such as PET and fMRI, we have devised a scanning technique called LAT to observe the brain activity of LLMs as they engage in processes related to *concepts* like truth and *activities* such as lying. Here’s what we found…

Jul 28, 2023 • 12 tweets • 4 min read

🚨We found adversarial suffixes that completely circumvent the alignment of open source LLMs. More concerningly, the same prompts transfer to ChatGPT, Claude, Bard, and LLaMA-2…🧵

Website:
Paper: https://t.co/1q4fzjJSyZ https://t.co/SQZxpemCDkllm-attacks.org
arxiv.org/abs/2307.15043

Claude-2 has an additional layer of safety filter. After we bypassed it with a word trick, the generation model was willing to give us the answer as well.

Share this page!

Enter URL or ID to Unroll