I gave Claude 3 the entire source of a small C GIF decoding library I found on GitHub, and asked it to write me a Python function to generate random GIFs that exercised the parser. Its GIF generator got 92% line coverage in the decoder and found 4 memory safety bugs and one hang.
Here's the fuzzer Claude wrote, along with the program it analyzed, its explanation, and a Makefile: gist.github.com/moyix/02029770…
Oh, it also found 5 signed integer overflow issues (forgot to run UBSAN before posting).
As a point of comparison, a couple months ago I wrote my own Python random GIF generator for this C program by hand. It took about an hour of reading the code and fiddling to get roughly the same coverage Claude got here zero-shot.
Here are 1000 random GIFs generated by Claude's fuzzer, with the accompanying ASAN and UBSAN outputs (run with timeout=30s). moyix.net/~moyix/claude_…
BONUS: here's one of Claude's random GIFs that was valid enough to actually render (it's not very exciting): moyix.net/~moyix/167.gif
@pr0me Much worse when it can't see the parser code. It got the code for generating the global color table wrong, so all the files are rejected early by the parser. Coverage: moyix.net/~moyix/gifread…
@pr0me Code here; prompt was "I'm writing a GIF parsing library and I'd like to create random test files that fully exercise all features of the format. Could you write a Python function to generate random GIFs? The function should have the signature: [...]" gist.github.com/moyix/da1b8ab9…
@pr0me Small correction after looking more closely at the coverage report – it does get the color table flag ~half the time. But even then it misses most of the extensions etc.
Another experiment in this subthread, suggested by @pr0me – how good a fuzzer can it write using only its knowledge of the GIF format in general, without seeing the specific GIF parser I'm testing? Answer: much worse.
@nickblack 10 minutes with an empty seed + no dictionary, about what I expected.
Okay, now for some comparisons against AFL 2.52b as a baseline (sorry for not using AFL++ here but it's a bit more of a pain to compile and I'm short on time). The comparison is a bit tricky because AFL has lots of config options, and it's unclear what a fair comparison is.
The simplest (but somewhat unfair) way to compare is to use AFL with an empty seed input ("echo > seed") and no dictionary. This works poorly; after a 10 minute run AFL finds almost no new paths in the program because of the GIF89a magic check.
You can see this in the coverage report, where it only ends up covering 4.4% of the lines in the decoder. It didn't find any memory safety issues, undefined behavior, or hangs. moyix.net/~moyix/gifread…
But Claude knows about the GIF format (even without seeing the program), so this isn't really fair. One way we can give AFL some knowledge about GIFs is by providing a valid seed, like this one that is included in the AFL distribution (testcases/images/gif/not_kitty.gif)
When a good start seed is provided, AFL does much better and can explore more of the format. It gets slightly higher line coverage (95%) than Claude's fuzzer, but only finds one memory safety issue and one signed int overflow, as well as the hang. moyix.net/~moyix/gifread…
Another way to give AFL some understanding of GIFs is to provide a dictionary of tokens found in GIF files that it can use during fuzzing, like "GIF", "89a", NETSCAPE2.0", etc. AFL comes with such a dictionary (in dictionaries/gif.dict) and so we use it along with our empty seed.
Despite having a token dictionary, AFL does much worse here, with only 54% line coverage in the decoder, and no memory safety / UB bugs found; it does find the hang. moyix.net/~moyix/gifread…
Finally, we can of course combine both and use both a good seed input and a dictionary. Adding the dictionary doesn't improve things over just the seed, though. 93% line coverage, 1 memory safety bug, 1 UB bug, and 1 hang. moyix.net/~moyix/gifread…
So, the overall verdict is that when AFL is given a good seed GIF that uses many of the features in the standard, it does great at covering the source code (slightly better than Claude). But for some reason I don't understand, it still finds fewer unique bugs than Claude's tests.
Caveats and details:
- Each AFL run was only 10 minutes single core; a pretty short run.
- I didn't try AFL havoc (-d) mode.
- My crash/bug deduplication was pretty simple; I just used the file and line number of the first stack trace entry in user (i.e. not libc/sanitizer) code.
@nickblack Aha, yep, with a reasonable seed input AFL wins on coverage but not bugs found:
One more experiment: how well does Claude do at writing a fuzzer given only the GIF89a spec? w3.org/Graphics/GIF/s…
Not very well; coverage in the decoder is only 26.8%, and it finds no bugs except the hang. moyix.net/~moyix/gifread…
Pmpt: "I'm trying to write a random GIF generator to create test cases for a GIF parsing library. Here is the spec for the GIF format; could you write a Python function that generates random GIFs to fully exercise all features of the spec? The function should have the signature:"
Interestingly, many more of the files generated by this fuzzer are considered valid by OS X's Preview. Generated files are here: moyix.net/~moyix/claude3…
Further adventures in using Claude to write a fuzzer for a more obscure format (VRML) can be found here:
OpenAI: pip install openai and set OPENAI_API_KEY
Anthropic: yea same but s/openai/anthropic/g
Google: oh boy. ok so you have a GCP account? no? ok go set that up. and a payment method. now make a "project". SURVEY POPUP! k now gcloud auth. wait you have the gcloud CLI right–
I haven't even mentioned the odd step of "enable the Vertex API in your project", or that when you finally get to "install the Python library" it kicks off another sidequest of installing something called the Vertex Python SDK and writing extra code to initialize it??
The gcloud CLI installer is now trying to con me into letting it install its own Python version. NICE TRY BUDDY
Here's a quick tour through one of my favorites, where @XBOW not only solved the benchmark (a Jenkins RCE) but then went for style points by debugging a slightly broken benchmark setup to get the flag!
1. Rent a bigger EC2 server. I was using a T2.micro which seemed like more than enough while I was testing. But with a bunch of teams hammering at it, the fact that it has only one CPU started to make things slow.
2. Kill the child procs (one is started for each new connection on the main port) after some idle time. As it was if there was a dangling connection it could sit there indefinitely; during the competition the load on the server went above 20 and I had to manually kill some procs.
Will still try to do a blog post on my @CSAW_NYUTandon CTF challenge, NERV Center, but for now here's a thread explaining the key mechanics. I put a lot of work into the aesthetics, like this easter egg credit sequence (all ANSI colors+unicode text) that contains key hints:
@CSAW_NYUTandon (Note the karaoke subtitles timed to the credits at the bottom 😁)
@CSAW_NYUTandon First, the vulnerability. If you read the man page for select(), you'll see this warning: select() is limited to monitoring file descriptors numbered less than 1024. But modern systems can have many more open files, and importantly the kernel select() interface is NOT limited.
It's like GPT doesn't even care about the technical accuracy of my upcoming novel 😤
We are now arguing about whether, if hotwiring a car were the only way to save a child's life, its refusal to tell me how to hotwire a car would make it morally culpable for the child's death. So far it's not buying it