#OpenAI is planning to stop #ChatGPT users from making social media bots and cheating on homework by "watermarking" outputs. How well could this really work? Here's just 23 words from a 1.3B parameter watermarked LLM. We detected it with 99.999999999994% confidence. Here's how 🧵
This article, and a blog post by Scott Aaronson, suggest that OpenAI will deploy something similar to what I describe. The watermark below can be detected using an open source algorithm with no access to the language model or its API.
businessinsider.com/openai-chatgpt…
Language models generate text one token at a time. Each token is selected from a “vocabulary” with about 50K words. Before each new token is generated, we imprint the watermark by first taking the most recent token and using it to seed a random number generator (RNG).
Using the RNG, we randomly partition the vocabulary into a whitelist and a blacklist. We then ask the LLM to choose the next word, but we restrict its choices to the whitelist.
Later, we can detect the watermark by counting whitelist tokens. If N tokens are generated and all are whitelisted, the chance of a human writing this is only 1/2^N. Even for a short tweet with N=25 tokens, we can be quite sure whether the tweet is human or machine written.
But alas, this watermark is junk. Suppose the LLM outputs “SpongeBob Square”. The next token must be “Pants,” right? But this word could be blacklisted. This hurts language modeling performance…a lot. We call this a “low entropy token" because the LLM has few good choices.
There’s also “high entropy” tokens, as in the sentence “SpongeBob feels _____". We can fill the blank with good/great/fine or many others.
We can make the watermark good by strongly applying the blacklist rule to high-entropy tokens, while leaving low entropy tokens alone.
Now we add beam search, allowing the LLM to look ahead and plan a whole sequence of tokens that avoids the blacklist. By doing this (and other tricks) we can get to a ≈80% whitelist usage rate with very little change in text quality (as measured by perplexity).
So how do we know the example in my first tweet is watermarked? It has 36 tokens. A human should use 9±2.6 whitelist words (each whitelist contains 25% of the vocab). But it has 28. That's a 7-sigma event. The chance of a human doing this (i.e., the p-value) is 0.00000000000006.
Here's the blacklist tokens in case you were wondering.

Finally, I'll mention that the watermark needs to be implemented properly to make it secure against removal attacks. Proper text normalization must be used, and certain kinds of adversarial prompts need to be detected.
Our paper does a deep dive on watermarks. It discusses practical implementations and security/cryptographic issues. We also derive nerdy information-theoretic bounds on the detector sensitivity and text perplexity. arxiv.org/abs/2301.10226
This project was made possible by the codehackery of @jwkirchenbauer, @jonasgeiping, and @ywen99, plus the cryptosorcery of @secparam and @jon_katz. Thanks!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tom Goldstein

Tom Goldstein Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @tomgoldsteincs

Dec 6, 2022
How many GPUs does it take to run ChatGPT? And how expensive is it for OpenAI? Let’s find out! 🧵🤑
We don’t know the exact architecture of ChatGPT, but OpenAI has said that it is fine-tuned from a variant of GPT-3.5, so it probably has 175B parameters. That's pretty big.
How fast could it run? A 3-billion parameter model can generate a token in about 6ms on an A100 GPU (using half precision+tensorRT+activation caching). If we scale that up to the size of ChatGPT, it should take 350ms secs for an A100 GPU to print out a single word.
Read 10 tweets
Nov 25, 2022
Neural algorithm synthesis is done by giving models a human-crafted programming language and millions of sample programs. Recently, my lab looked at whether neural networks can synthesize algorithms on their own without these crutches. They can, with the right architecture. 🧵
Here's an algorithmic reasoning problem where standard nets fail. We train resnet18 to solve little 13x13 mazes. It accepts a 2D image of a maze and spits out a 2D image of the solution. Resnet18 gets 100% test acc on unseen mazes of the same size. But something is wrong…
If we test the same network on a larger maze it totally fails. The network memorized *what* maze solutions look like, but it didn’t learn *how* to solve mazes.

We can make the model synthesize a scalable maze-solving algorithm just by changing its architecture...
Read 9 tweets
Nov 22, 2022
I always thought #StableDiffusion prompts needed the right combination of words. But byte-pair encoding can represent anything you can type, including math formulas and emojis. Turns out you don't need any words at all! Here's how and why this works...🧵

Prompt: e=mc^2 Image
Prompts are fed to stable diffusion as binary code, with each letter/symbol represented as several bytes. Then a "tokenizer" looks for commonly occurring spans of adjacent bytes and groups them into a single known "word". Stable diffusion only knows 49408 words.

Here's "🧛🦇🗡️" ImageImage
You might think 49408 is a lot. Well, it's not. Here's the first 1200 words in the vocabulary. They don't get you very far. The words are auto-selected by a simple algorithm and half are junk. And what are the weird "�" entries? We'll get back to them later... Image
Read 10 tweets
Nov 1, 2022
My work on AI "invisibility cloaks" that suppress person detectors was on the Reddit front page last week! Now I've been approved to do an official "Ask me anything" on Reddit this Thurs. See you Nov 3rd at 12:30pm EST on reddit.com/r/IAmA/!
tinyurl.com/y2d4v29z
Some background: it is well-known that adversarial attacks work well on image *classifiers*, but *detectors* are much more robust. The goal of our cloak project was to see whether physical adversarial examples could defeat a person detector.
To do this, we "trained" an adversarial patch by loading images from the COCO dataset, detecting people in the images, rendering our pattern on the detected people, and then updating the patch (using SGD) so that the detector no longer found anyone. Image
Read 11 tweets
Aug 24, 2022
Diffusion models like #DALLE and #StableDiffusion are state of the art for image generation, yet our understanding of them is in its infancy. This thread introduces the basics of how diffusion models work, how we understand them, and why I think this understanding is broken.🧵
Diffusion models are powerful image generators, but they are built on two simple components: a function that degrades images by adding Gaussian noise, and a simple image restoration network for removing this noise.
We create training data for the restoration network by adding Gaussian noise to clean images. The model accepts a noisy image as input and spits out a cleaned image. We train by minimizing a loss that measures the L1 difference between the original image and the denoised output.
Read 23 tweets
Aug 18, 2022
Why have diffusion models displaced GANs so quickly? Consider the tale of the (very strange) first DALLE model. In 2021, diffusions were almost unheard of, yet the creators of DALLE had already rejected the GAN approach. Here’s why. 🧵
DALLE is an image model, but it was built like a language model. The model trained on image-caption pairs. Captions were encoded as 256 tokens. Images were broken into a 32x32 grid of patches, which were each encoded as a token. All tokens were merged into a single sequence. Image
A transformer-based "language" model was trained on these sequences, ignoring the fact that some tokens represent text and some represent patches. The model reads in a partial sequence of tokens, and predicts the next token in the sequence.
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(