When do RLHF policies appear aligned but misbehave in subtle ways?
Consider a terminal assistant that hides error messages to receive better human feedback.
We provide a formal definition of deception and prove conditions about when RLHF causes it. A 🧵
Past RLHF theory assumes the human and the AI fully observe the environment. With enough data, the correct return function, and thus the optimal policy, can be inferred.
But what happens when the human only has partial observations?
The AI might hide information to make appearances better than reality! In our example above, the AI hides errors by redirecting stderr to /dev/null.
Deception is “the systematic inducement of false beliefs in the pursuit of some outcome other than the truth,” proposed @dr_park_phd et al.
We mathematically formalize this definition. Roughly, it’s when the AI increases human belief error while increasing another objective.
We also define a dual risk, “overjustification.”
With overjustification, the AI incurs an unwanted cost in order to correct the human’s mistaken belief that the AI is misbehaving.
We then prove conditions when RLHF causes deception and/or overjustification.
For deterministic observation models, eg terminal outputs: if RLHF leads to a suboptimal policy, then the policy is deceptively inflating performance, overjustifying behavior to impress, or both!
Can we do better than naive RLHF?
We give a mathematical characterization of how partial observability of the environment translates into ambiguity of the possible return function.
When the set of ambiguities is a singleton, the AI can (in theory) infer the optimal policy!
But it’s not all good news. In practice, applying our ambiguity theorem requires knowledge of the observation and human models.
Even assuming this knowledge, our theory shows there can be irreducible ambiguity. Partial observability can create hard limits on reward inference.
Partial observability is one formalization of the following problem: how do we align AI, even when the AI knows more than humans?
Our work helps provide a theoretical understanding of the problem, and there’s lots more to be done!
For example, we expect learning a human belief model (assumed by our theorem) to be a challenging problem.
When giving feedback on observations, what does the human implicitly believe about the underlying world state?
Our theory suggests this as a direction for future work!
Thanks to these awesome collaborators: @Lang__Leon, Davis Foote, Stuart Russell, @ancadianadragan, and @jenner_erik
Paper: arxiv.org/abs/2402.17747
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.
