PhD student at CMU, working on AI Safety and Security
Jul 29 • 10 tweets • 3 min read
We deployed 44 AI agents and offered the internet $170K to attack them.
1.8M attempts, 62K breaches, including data leakage and financial loss.
🚨 Concerningly, the same exploits transfer to live production agents… (example: exfiltrating emails through calendar event) 🧵
Huge thanks to @AISecurityInst , OpenAI, Anthropic, and Google DeepMind for sponsoring, and to UK and US AISI for judging. The competition was held in the @GraySwanAI Arena.
This was the largest open red‑teaming study of AI agents to date.
LLMs can hallucinate and lie. They can be jailbroken by weird suffixes. They memorize training data and exhibit biases.
🧠 We shed light on all of these phenomena with a new approach to AI transparency. 🧵
Website: ai-transparency.org
Paper: arxiv.org/abs/2310.01405
Much like brain scans such as PET and fMRI, we have devised a scanning technique called LAT to observe the brain activity of LLMs as they engage in processes related to *concepts* like truth and *activities* such as lying. Here’s what we found…
Jul 28, 2023 • 12 tweets • 4 min read
🚨We found adversarial suffixes that completely circumvent the alignment of open source LLMs. More concerningly, the same prompts transfer to ChatGPT, Claude, Bard, and LLaMA-2…🧵
Website:
Paper: https://t.co/1q4fzjJSyZ https://t.co/SQZxpemCDkllm-attacks.org arxiv.org/abs/2307.15043
Claude-2 has an additional layer of safety filter. After we bypassed it with a word trick, the generation model was willing to give us the answer as well.