Eric Profile picture
I do AI & music, mostly. Making LLMs easier & safer to use. Final year PhD student at Stanford. Also on the sewing app @ericmitchellai

Jan 27, 2023, 9 tweets

ChatGPT (and others) generate very fluent (but not always truthful) text.

Some worry that teachers, news-readers (like you!), and society in general will be swamped with AI-generated content.

That's why we built DetectGPT, a method for detecting if text comes from an LM.

One way to detect LM text is with a trained classifier (another LM). This works, but can overfit to the models/topics it was trained on.

Instead, if we can access the LM itself, we can use its own log probabilities to do detection *zero-shot*, without any training at all!

But how? Well, one simple approach is to measure the log probability of the text under the model. High log prob -> model sample.

This works, but DetectGPT takes a different approach that turns out to be consistently more accurate in our experiments.

Quick q:

What do we expect the log probability function to look like in the neighborhood of a model sample?

We hypothesized that a model's samples are usually in local maxima of its log probability function, or more generally, in areas of negative curvature.

Spoiler: they are!

But how to measure curvature of an LM's logprob, you ask??

We can approximate *directional second derivatives* by perturbing the text a bit with T5 & comparing the logprob under the LM before and after.

Add Hutchinson's trace estimator & we get approximate trace of the Hessian.

DetectGPT takes this approximate Hessian trace, and simply thresholds it to get a detector.

Hessian trace very negative? Probably a model sample!

Turns out this quantity discriminates between human-written and model-generated text very well, for various models and scales.

Does it work?

DetectGPT consistently improves AUROC (prob. a random pair of fake/human text is correctly classified) over existing zero-shot methods, for models with 100M to 175B parameters.

It's also competitive with supervised classifiers, outperforming them in some domains.

There are other goodies in the experiments… for example, we explore robustness of detection to machine-generated text that has been partially revised.

Check out the paper for more (and website for code/demo soon)

arxiv.org/abs/2301.11305
ericmitchell.ai/detectgpt

So much fun working on this with:

@yoonholeee
@SashaKhazatsky
@chrmanning
@chelseabfinn

Also extremely grateful for the support of Stanford's Center for Research on Foundation Models @StanfordCRFM in running experiments on some very large LMs!!

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling