Collin Burns Profile picture
PhD Student @berkeley_ai working on making language models honest, interpretable, and aligned. Former Rubik's Cube world record holder.
Jason Hoelscher-Obermaier Profile picture 1 subscribed
Dec 8, 2022 13 tweets 3 min read
How can we figure out if what a language model says is true, even when human evaluators can’t easily tell?

We show (arxiv.org/abs/2212.03827) that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 🧵 Existing techniques for training language models are misaligned with the truth: if we train models to imitate human data, they can output human-like errors; if we train them to generate highly-rated text, they can output errors that human evaluators can’t assess or don’t notice.