Tweet

Collin Burns

Dec 8 • 13 tweets • 3 min read

How can we figure out if what a language model says is true, even when human evaluators can’t easily tell?

We show (arxiv.org/abs/2212.03827) that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 🧵

Existing techniques for training language models are misaligned with the truth: if we train models to imitate human data, they can output human-like errors; if we train them to generate highly-rated text, they can output errors that human evaluators can’t assess or don’t notice.

We propose trying to circumvent this issue by directly finding latent “truth-like” features inside language model activations without using any human supervision in the first place.

Informally, instead of trying to explicitly, externally specify ground truth labels, we search for implicit, internal “beliefs” or “knowledge” learned by a model.

This may be possible to do because truth satisfies special structure: unlike most features in a model, it is *logically consistent*

We make this intuition concrete by introducing Contrast-Consistent Search (CCS), a method that searches for a direction in activation space that satisfies negation consistency.

We find that on a diverse set of tasks (NLI, sentiment classification, cloze tasks, etc.), our method can recover correct answers from model activations with high accuracy (even outperforming zero-shot prompting) despite not using any labels or model outputs.

Among other findings, we also show that CCS really recovers something different from just the model outputs; it continues to work well in several cases where model outputs are unreliable or uninformative.

Of course, our work has important limitations and creates many new questions for future work. CCS still fails sometimes and there’s still a lot that we don’t understand about when this type of approach should be feasible in the first place.

Nevertheless, we found it surprising that we could make substantial progress on this problem at all.

(Imagine recording a person's brain activity as you tell them T/F statements, then classifying those statements as true or false just from the raw, unlabeled neural recordings!)

This problem is important because as language models become more capable, they may output false text in increasingly severe and difficult-to-detect ways. Some models may even have incentives to deliberately “lie”, which could make human feedback particularly unreliable.

However, our results suggest that unsupervised approaches to making models truthful may also be a viable – and more scalable – alternative to human feedback.

For many more details, please check out our paper (arxiv.org/abs/2212.03827) and code (tinyurl.com/latentknowledge)!

@JacobSteinhardt

(And a huge thanks to my excellent collaborators -- Haotian Ye, Dan Klein, and @JacobSteinhardt -- for helping make this happen!)

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Collin Burns

People who liked this thread also liked...

Try unrolling a thread yourself!

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!