Latest Twitter Threads by @emmons_scott on Thread Reader App

Feb 28, 2024 • 11 tweets • 3 min read

When do RLHF policies appear aligned but misbehave in subtle ways?

Consider a terminal assistant that hides error messages to receive better human feedback.

We provide a formal definition of deception and prove conditions about when RLHF causes it. A 🧵

https://twitter.com/percyliang/status/1600383429463355392

Past RLHF theory assumes the human and the AI fully observe the environment. With enough data, the correct return function, and thus the optimal policy, can be inferred.

But what happens when the human only has partial observations?

Share this page!

Enter URL or ID to Unroll