Scott Emmons Profile picture
Feb 28, 2024 11 tweets 3 min read Read on X
When do RLHF policies appear aligned but misbehave in subtle ways?

Consider a terminal assistant that hides error messages to receive better human feedback.

We provide a formal definition of deception and prove conditions about when RLHF causes it. A 🧵
Image
Past RLHF theory assumes the human and the AI fully observe the environment. With enough data, the correct return function, and thus the optimal policy, can be inferred.

But what happens when the human only has partial observations?
The AI might hide information to make appearances better than reality! In our example above, the AI hides errors by redirecting stderr to /dev/null.
Deception is “the systematic inducement of false beliefs in the pursuit of some outcome other than the truth,” proposed @dr_park_phd et al.

We mathematically formalize this definition. Roughly, it’s when the AI increases human belief error while increasing another objective. Image
We also define a dual risk, “overjustification.”

With overjustification, the AI incurs an unwanted cost in order to correct the human’s mistaken belief that the AI is misbehaving.
We then prove conditions when RLHF causes deception and/or overjustification.

For deterministic observation models, eg terminal outputs: if RLHF leads to a suboptimal policy, then the policy is deceptively inflating performance, overjustifying behavior to impress, or both!
Can we do better than naive RLHF?

We give a mathematical characterization of how partial observability of the environment translates into ambiguity of the possible return function.

When the set of ambiguities is a singleton, the AI can (in theory) infer the optimal policy! Image
But it’s not all good news. In practice, applying our ambiguity theorem requires knowledge of the observation and human models.

Even assuming this knowledge, our theory shows there can be irreducible ambiguity. Partial observability can create hard limits on reward inference.
Partial observability is one formalization of the following problem: how do we align AI, even when the AI knows more than humans?

Our work helps provide a theoretical understanding of the problem, and there’s lots more to be done!
For example, we expect learning a human belief model (assumed by our theorem) to be a challenging problem.

When giving feedback on observations, what does the human implicitly believe about the underlying world state?

Our theory suggests this as a direction for future work!
Thanks to these awesome collaborators: @Lang__Leon, Davis Foote, Stuart Russell, @ancadianadragan, and @jenner_erik

Paper: arxiv.org/abs/2402.17747

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Scott Emmons

Scott Emmons Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(