Thread Reader
Share this page!
×
Post
Share
Email
Enter URL or ID to Unroll
×
Unroll Thread
You can paste full URL like: https://x.com/threadreaderapp/status/1644127596119195649
or just the ID like: 1644127596119195649
How to get URL link on X (Twitter) App
On the Twitter thread, click on
or
icon on the bottom
Click again on
or
Share Via icon
Click on
Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at
Twitter Help
Scott Emmons
@emmons_scott
Research Scientist @GoogleDeepMind | PhD Student @berkeley_ai | views my own
Subscribe
Save as PDF
Feb 28
•
11 tweets
•
3 min read
When do RLHF policies appear aligned but misbehave in subtle ways?
Consider a terminal assistant that hides error messages to receive better human feedback.
We provide a formal definition of deception and prove conditions about when RLHF causes it. A 🧵
https://twitter.com/percyliang/status/1600383429463355392
Past RLHF theory assumes the human and the AI fully observe the environment. With enough data, the correct return function, and thus the optimal policy, can be inferred.
But what happens when the human only has partial observations?