Latest Twitter Threads by @CollinBurns4 on Thread Reader App

Dec 14, 2023 • 12 tweets • 3 min read

I’m extremely excited to finally share the first paper from the OpenAI Superalignment team :)

In it, we introduce a new research direction for aligning superhuman AI systems. 🧵

https://twitter.com/OpenAI/status/1735349718765715913

Humans won't be able to supervise models smarter than us. For example, if a superhuman model generates a million lines of extremely complicated code, we won’t be able to tell if it’s safe to run or not, if it follows our instructions or not, and so on.

Dec 8, 2022 • 13 tweets • 3 min read

How can we figure out if what a language model says is true, even when human evaluators can’t easily tell?

We show (arxiv.org/abs/2212.03827) that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 🧵 Existing techniques for training language models are misaligned with the truth: if we train models to imitate human data, they can output human-like errors; if we train them to generate highly-rated text, they can output errors that human evaluators can’t assess or don’t notice.

Share this page!

Enter URL or ID to Unroll