Collin Burns Profile picture
Superalignment @OpenAI. Formerly @berkeley_ai @Columbia. Former Rubik's Cube world record holder.
Dec 14, 2023 12 tweets 3 min read
I’m extremely excited to finally share the first paper from the OpenAI Superalignment team :)

In it, we introduce a new research direction for aligning superhuman AI systems. 🧵

Humans won't be able to supervise models smarter than us. For example, if a superhuman model generates a million lines of extremely complicated code, we won’t be able to tell if it’s safe to run or not, if it follows our instructions or not, and so on.
Dec 8, 2022 13 tweets 3 min read
How can we figure out if what a language model says is true, even when human evaluators can’t easily tell?

We show (arxiv.org/abs/2212.03827) that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 🧵 Existing techniques for training language models are misaligned with the truth: if we train models to imitate human data, they can output human-like errors; if we train them to generate highly-rated text, they can output errors that human evaluators can’t assess or don’t notice.