At OpenAI a lot of our work aims to align language models like ChatGPT with human preferences. But this could become much harder once models can act coherently over long timeframes and exploit human fallibility to get more reward.
📜Paper: arxiv.org/abs/2209.00626
🧵Thread:
In short, the alignment problem is that highly capable AIs might learn to pursue undesirable goals. It’s been discussed for a long time (including by Minsky, Turing and Wiener) but primarily in abstract terms that aren’t grounded in the ML literature. Our paper aims to fix that.
We describe the alignment problem in terms of emergent traits which might arise in more capable systems. Emergence is common in ML: in-context learning and chain-of-thought reasoning are two prominent examples, but @_jasonwei catalogs dozens of others:
We focus on three emergent traits. Models trained with RL from human feedback might learn deceptive “reward hacking” which exploits human errors, learn internally-represented goals which generalize beyond their training tasks, and use power-seeking strategies to pursue them.
Reward hacking is when an RL agent finds an unintended and undesirable high-reward strategy. @vkrakovna explores a number of examples, like this robot claw which fooled human supervisors by moving between ball and camera to look like it grabbed the ball: deepmind.com/blog/specifica…
Language models can show similar behavior: when trained to maximize approval, they may hallucinate compelling but false statements. But we expect reward hacking to be even harder to detect once models learn a property we call situational awareness. We define this as a model's
ability to apply abstract knowledge to the context in which it's run. Take a recent example: when ChatGPT was asked to navigate to the chat.openai.com website, it invented fake source code which it pretended would run an AI similar to itself!
What's going on here? It looks like ChatGPT "knows" facts about the models OpenAI tends to deploy, and sometimes uses those facts when responding to prompts. This example is a fun one, but similar models also know facts relevant to reward hacking on their training tasks, like:
- What false answers would humans believe?
- What misbehavior would humans fail to detect?
- What inputs are more likely at test time than in training?
If models start using those facts to choose “deceptive” high-reward actions, it'd be very difficult to detect and penalize that.
Next we look at what misalignment out-of-distribution might look like. Consider two types of generalization failure:
- The agent behaves incompetently OOD.
- The agent behaves competently to pursue an undesirable goal, known as goal misgeneralization. arxiv.org/abs/2210.01790
If agents learn robust representations of goals, they might consistently pursue them in a range of situations. There’s a lot of evidence suggesting that even networks without built-in planning algorithms can learn internally-represented goals and implicitly plan towards them!
If models do learn internally-represented goals, they might be “aligned” goals like helpfulness, honesty and harmlessness. But misaligned goals could also be strongly correlated with rewards for two reasons: 1. Humans are fallible and often misspecify rewards, as discussed above.
2. Spurious correlations could arise between rewards and features of the training environment. E.g. making money is usually correlated with success on real-world tasks. So policies may learn to pursue the internally-represented goal of “making money” even when we don’t want that.
Misaligned goals like these could lead to emergent “power-seeking” behaviors such as self-preservation, or gaining resources and knowledge. As Russell put it, “you can’t fetch coffee if you’re dead”. In other words, most agents with more power can achieve their goals better.
For an agent capable of long-term planning, those subgoals wouldn’t need to be built in: they’d just emerge from reasoning about its situation. Even existing language models can deduce their importance, if you ask them how to achieve some large-scale goal.
More formally, these subgoals all tend to increase average reward on many reward functions: Alex Turner's POWER metric. He proves that optimal policies have a statistical tendency to move to high-POWER states in MDPs - see his NeurIPS talk for more: neurips.cc/virtual/2021/p…
Power-seeking policies would misbehave both during training and during deployment. During training, they’d choose actions which lead to the best deployment outcomes—e.g. gaining human trust by pretending to be well-behaved, as @jacobsteinhardt describes: bounded-regret.ghost.io/ml-systems-wil…
If misaligned AIs are deployed like current AI assistants, then millions of copies of them might be run, which seems pretty worrying! They might then accumulate power in many of the same (legal or illegal) ways that unethical corporations do when governmental oversight is weak.
That’s assuming that their capabilities don’t progress much faster, which remains a live possibility. AI self-improvement always seemed a far-off prospect, but we’re starting to see some early empirical examples, which makes the idea much more salient: arxiv.org/abs/2210.11610
We finish the paper by reviewing a bunch of exciting alignment research from @janleike, @ch402, @paulfchristiano, @jacobsteinhardt, and many others. The field has been growing fast and there are many interesting research directions to explore.
We'd love to hear more discussion and critique of these ideas. Although they're more speculative than most claims in ML, the field is moving fast enough that we should be preparing in advance.
Perhaps you thought this thread was interesting, important, or missing a lot of details? (We agree with all three.) If so, sounds like you should check out or share the full paper - here it is again: arxiv.org/abs/2209.00626
I also recently discussed an earlier version of the paper on this podcast, for those who want a more layperson-friendly summary: open.spotify.com/episode/6IdlhB…. I think the new version is much better though - many thanks to @sorenmind and @justanotherlaw for working with me on it.
Lastly, along with this paper I’ve uploaded a significantly revised version of my Alignment Fundamentals curriculum, a two-month course exploring the alignment problem and potential solutions. Apply by Jan 5 for the next cohort, or read it yourself here: agisafetyfundamentals.com/ai-alignment-c…
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Over the last few years in particular there's been a proliferation of rigorous theoretical and empirical research on reasons to be concerned about existential risk from AI.
In this thread I list some of the papers published in top ML venues which explore key ideas:
This paper describes and studies a key reason for concern: that a model's capabilities might generalize out-of-distribution without its goals generalizing in desirable ways.
As the field of machine learning advances, the alignment problem is becoming increasingly concrete. In a new report, I present the key high-level arguments for expecting AGI to be misaligned, grounded in technical details of deep learning. drive.google.com/file/d/1TsB7Wm… Thread:
To start, why think about this now? ML progress has been astonishingly rapid over the last decade. And surveys show many top ML researchers expect AGI within decades. So if misalignment is plausible we should do our best to understand the core problem. aiimpacts.org/2022-expert-su…
Alignment arguments are often very abstract because we lack empirical data or solid theory for AGI. But that's exactly why the problem seems hard to solve, and why we should work to build those theories in advance! If we don't, we may have little time to catch up as we near AGI.
Hypothesis: people change their minds slowly because many of their beliefs are heavily entangled with their broader worldview. To persuade them, focus less on finding evidence for specific claims, and more on providing an alternative worldview in which that evidence makes sense.
Empirical disagreements often segue into moral disagreements because worldviews tend to combine both types of claims. For people to adopt a new worldview, they need to see how it can guide their actions towards goals and values they care about. clearerthinking.org/amp/understand…
This frame makes slow updating more rational. Coherent decision-making is hard when using a worldview with many exceptions or poor moral foundations. When it's hard to synthesize a new combined worldview you should often still bet on your current one.
Whenever I overhear someone at a party saying "so what do you do?" I feel a twinge of second-hand shame. Surely, as a society, we can just agree to ask each other better questions than that.
IMO also avoid "what's your story?" and "so how do you know [host]?"
Better options:
- What have you been excited about lately?
- What about you tends to surprise people?
- What do you enjoy/care about?
- What have you been looking for lately?
- What are you looking forward to?
People have many objections but I still think "what do you do" produces:
- status dynamics of judging each other by jobs
- the same conversational paths
- easy ways to pigeonhole people
Small changes by askers would prevent the burden of avoiding these from falling on answerers.
Asked a ML researcher friend what would convince her that large models aren't "just reflecting their data" - she said understanding novel similes and metaphors.
Here's one that GPT-3 basically nailed - but it's probably not fully novel. Other suggestions?
This one is probably more novel. I think I'd actually bet that this is better than the median human response.
(All of these I'm rerolling 2-5 times btw, so a little bit of cherrypicking.)
Another suggested by @SianGooding (the friend I was originally debating).
Gave it a few tries at this because I have no idea what it means myself.
New York’s nightclubs are the particle accelerators of sociology: reliably creating precise conditions under which exotic extremes of status-seeking behaviour can be observed.
Ashley Mears, model turned sociology professor, wrote a great book about them. Thread of key points:
Some activities which are often fun - dancing, drinking, socialising - become much more fun when you're also feeling high-status. So wealthy men want to buy the feeling of high-status fun, by partying alongside other high-status people, particularly beautiful women. 2/N
But explicit transactions between different forms of cultural capital are low-status - it demonstrates that you can’t get the other forms directly. So the wealthy men can’t just pay the beautiful women to come party with them. 3/N