Humans won't be able to supervise models smarter than us. For example, if a superhuman model generates a million lines of extremely complicated code, we won’t be able to tell if it’s safe to run or not, if it follows our instructions or not, and so on.
Dec 8, 2022 • 13 tweets • 3 min read
How can we figure out if what a language model says is true, even when human evaluators can’t easily tell?
We show (arxiv.org/abs/2212.03827) that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 🧵
Existing techniques for training language models are misaligned with the truth: if we train models to imitate human data, they can output human-like errors; if we train them to generate highly-rated text, they can output errors that human evaluators can’t assess or don’t notice.