Ariel Herbert-Voss Profile picture
Adversarial machine learning and security, occasionally math and dumb memes. Research scientist @OpenAI / CS PhD @Harvard / cofounder @aivillage_dc @defcon
Apr 24, 2023 9 tweets 3 min read
There’s a lot of fearmongering about LLMs being capable of finding 0day

There are three highly complex roadblocks that need to be overcome for this to be a real concern: statefulness, hallucination, and contamination I just need to predict the ... Statefulness refers to the ability to store and run things on program states. LLMs use large amounts of data to predict the next token in a sequence. There are promising advances with retrieval, but it's meant to augment seq prediction accuracy rather than replace the paradigm
Mar 25, 2023 11 tweets 2 min read
the idea that you can just break into a data center and steal the model has a lot of memetic sticking power, but is stupid if you actually know anything about this topic. here's a thread on how confidential computing works in the NVIDIA H100: the first thing to know is that "confidential computing" is an "industry compliance" term - this means a bunch of nerds specified a set of security features that all device manufacturers that use that keyword need to comply with
Feb 10, 2021 9 tweets 2 min read
I was initially excited to see our attack already showing up in the wild but the numbers reported didn’t line up with our experiments - so I dug into it. A THREAD: Preliminary results from extracting PII from GPT2 and GPT3 show that you can only get something that looks like PII ~20% of the time when you directly query the model (with some variation depending on prompt design/type of PII you’re trying to extract)
May 26, 2019 5 tweets 1 min read
Some news: I’m writing a book for @nostarch titled “The Machine Learning Red Team Manual”. My aim is to provide a practical guide for anyone interested in adversarial ML and red teaming as it relates to in-production ML systems. A short thread on why this project matters: ML is the hot thing to integrate into products and services across many different sectors of the economy. However, systems predicated on ML have unique security considerations: vulnerabilities are present at both the algorithmic and systems levels.
Mar 31, 2019 12 tweets 4 min read
Very cool work showing feasibility of an adversarial-example-based attack on self-driving cars 😈 I’ve been working on a similar hobby project and love how thorough this write-up is, and I have some comments on the real-world feasibility of these attacks: They attack autowipers and lane-following through both digital and physical attacks. For digital they show you can inject adversarial examples onto GPU by hooking t_cuda_std_tmrc::compute. This is obviously much harder to accomplish IRL but absolutely worth considering