Scott Emmons Profile picture
Research Scientist @GoogleDeepMind | PhD Student @berkeley_ai | views my own
Feb 28 11 tweets 3 min read
When do RLHF policies appear aligned but misbehave in subtle ways?

Consider a terminal assistant that hides error messages to receive better human feedback.

We provide a formal definition of deception and prove conditions about when RLHF causes it. A 🧵
Image Past RLHF theory assumes the human and the AI fully observe the environment. With enough data, the correct return function, and thus the optimal policy, can be inferred.

But what happens when the human only has partial observations?