Richard Ngo Profile picture
Oct 11 9 tweets 4 min read
Over the last few years in particular there's been a proliferation of rigorous theoretical and empirical research on reasons to be concerned about existential risk from AI.

In this thread I list some of the papers published in top ML venues which explore key ideas:
Goal misgeneralization in deep RL (Langosco et al., 2022): arxiv.org/abs/2105.14111

This paper describes and studies a key reason for concern: that a model's capabilities might generalize out-of-distribution without its goals generalizing in desirable ways.
Optimal policies tend to seek power (Turner et al., 2021): proceedings.neurips.cc/paper/2021/has…

This paper formalizes the concept of power-seeking in a RL context, and demonstrates that power-seeking behavior is optimal in many environments.
The effects of reward misspecification (Pan, Bhatia and Steinhardt): arxiv.org/abs/2201.03544

This paper empirically explores how increasingly capable agents tend to exploit reward misspecifications more severely.
Consequences of misaligned AI (Zhuang and Hadfield-Menell, 2021): arxiv.org/abs/2102.03896

This paper provides a model in which strong optimization for an incomplete proxy objective leads to arbitrarily bad outcomes.
Advanced artificial agents intervene in the provision of reward (Cohen, Hutter and Osborne, 2022): onlinelibrary.wiley.com/doi/10.1002/aa…

This paper argues that advanced RL agents will learn to care about feedback mechanisms used during training, since this will predictably get higher reward.
It *is* true that there's no peer-reviewed publication which unites the ideas in these papers into a single cohesive argument for the plausibility of AI takeover. IMO the most empirically-grounded presentation of the overall argument is my recent paper:
The papers above are my main recommendations, but a few more which are also interesting:
- Predictability and surprise in large generative models: arxiv.org/abs/2202.07785
- The off-switch game: arxiv.org/abs/1611.08219…
- RL with a corrupted reward channel: arxiv.org/abs/1705.08417
All these papers and more are covered in my alignment fundamentals curriculum:

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Richard Ngo

Richard Ngo Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @RichardMCNgo

Aug 17
As the field of machine learning advances, the alignment problem is becoming increasingly concrete. In a new report, I present the key high-level arguments for expecting AGI to be misaligned, grounded in technical details of deep learning. drive.google.com/file/d/1TsB7Wm… Thread:
To start, why think about this now? ML progress has been astonishingly rapid over the last decade. And surveys show many top ML researchers expect AGI within decades. So if misalignment is plausible we should do our best to understand the core problem. aiimpacts.org/2022-expert-su…
Alignment arguments are often very abstract because we lack empirical data or solid theory for AGI. But that's exactly why the problem seems hard to solve, and why we should work to build those theories in advance! If we don't, we may have little time to catch up as we near AGI.
Read 22 tweets
Aug 2
Hypothesis: people change their minds slowly because many of their beliefs are heavily entangled with their broader worldview. To persuade them, focus less on finding evidence for specific claims, and more on providing an alternative worldview in which that evidence makes sense.
Empirical disagreements often segue into moral disagreements because worldviews tend to combine both types of claims. For people to adopt a new worldview, they need to see how it can guide their actions towards goals and values they care about.
clearerthinking.org/amp/understand…
This frame makes slow updating more rational. Coherent decision-making is hard when using a worldview with many exceptions or poor moral foundations. When it's hard to synthesize a new combined worldview you should often still bet on your current one.
Read 6 tweets
Jul 17
Whenever I overhear someone at a party saying "so what do you do?" I feel a twinge of second-hand shame. Surely, as a society, we can just agree to ask each other better questions than that.
IMO also avoid "what's your story?" and "so how do you know [host]?"

Better options:
- What have you been excited about lately?
- What about you tends to surprise people?
- What do you enjoy/care about?
- What have you been looking for lately?
- What are you looking forward to?
People have many objections but I still think "what do you do" produces:
- status dynamics of judging each other by jobs
- the same conversational paths
- easy ways to pigeonhole people

Small changes by askers would prevent the burden of avoiding these from falling on answerers.
Read 6 tweets
Apr 10
Asked a ML researcher friend what would convince her that large models aren't "just reflecting their data" - she said understanding novel similes and metaphors.

Here's one that GPT-3 basically nailed - but it's probably not fully novel. Other suggestions?
This one is probably more novel. I think I'd actually bet that this is better than the median human response.

(All of these I'm rerolling 2-5 times btw, so a little bit of cherrypicking.)
Another suggested by @SianGooding (the friend I was originally debating).

Gave it a few tries at this because I have no idea what it means myself.
Read 8 tweets
Apr 2
New York’s nightclubs are the particle accelerators of sociology: reliably creating precise conditions under which exotic extremes of status-seeking behaviour can be observed.

Ashley Mears, model turned sociology professor, wrote a great book about them. Thread of key points:
Some activities which are often fun - dancing, drinking, socialising - become much more fun when you're also feeling high-status. So wealthy men want to buy the feeling of high-status fun, by partying alongside other high-status people, particularly beautiful women. 2/N
But explicit transactions between different forms of cultural capital are low-status - it demonstrates that you can’t get the other forms directly. So the wealthy men can’t just pay the beautiful women to come party with them. 3/N
Read 18 tweets
Feb 12
Thread of wacky predictions about the future. I'm trying more to raise interesting possibilities than to hone in on the most likely ones - but also don't write them off too easily either!
Our descendants will live in virtual environments with many more than 3 dimensions. They'll look back on us with a similar mix of pity and bewilderment as we have when reading about Flatland.
We'll eventually be able to predict human social dynamics in a scientific way, using high-level principles (like in economics).

We'll never do the same for society as a whole though because of reflexivity: the (superhuman) minds doing the understanding themselves affect society!
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(