Tweet

Richard Ngo

Oct 11 • 9 tweets • 4 min read

https://twitter.com/DavidSKrueger/status/1579864061382397954

Over the last few years in particular there's been a proliferation of rigorous theoretical and empirical research on reasons to be concerned about existential risk from AI.

In this thread I list some of the papers published in top ML venues which explore key ideas:

https://twitter.com/DavidSKrueger/status/1579864061382397954

Goal misgeneralization in deep RL (Langosco et al., 2022): arxiv.org/abs/2105.14111

This paper describes and studies a key reason for concern: that a model's capabilities might generalize out-of-distribution without its goals generalizing in desirable ways.

Optimal policies tend to seek power (Turner et al., 2021): proceedings.neurips.cc/paper/2021/has…

This paper formalizes the concept of power-seeking in a RL context, and demonstrates that power-seeking behavior is optimal in many environments.

The effects of reward misspecification (Pan, Bhatia and Steinhardt): arxiv.org/abs/2201.03544

This paper empirically explores how increasingly capable agents tend to exploit reward misspecifications more severely.

Consequences of misaligned AI (Zhuang and Hadfield-Menell, 2021): arxiv.org/abs/2102.03896

This paper provides a model in which strong optimization for an incomplete proxy objective leads to arbitrarily bad outcomes.

Advanced artificial agents intervene in the provision of reward (Cohen, Hutter and Osborne, 2022): onlinelibrary.wiley.com/doi/10.1002/aa…

This paper argues that advanced RL agents will learn to care about feedback mechanisms used during training, since this will predictably get higher reward.

It *is* true that there's no peer-reviewed publication which unites the ideas in these papers into a single cohesive argument for the plausibility of AI takeover. IMO the most empirically-grounded presentation of the overall argument is my recent paper:

https://twitter.com/RichardMCNgo/status/1559991216636186624?t=g4WMHblYFvkYgj-YQZW7CQ&s=19

The papers above are my main recommendations, but a few more which are also interesting:
- Predictability and surprise in large generative models: arxiv.org/abs/2202.07785
- The off-switch game: arxiv.org/abs/1611.08219…
- RL with a corrupted reward channel: arxiv.org/abs/1705.08417

All these papers and more are covered in my alignment fundamentals curriculum:

https://twitter.com/RichardMCNgo/status/1492544228680888320?t=laRxg8QaArKD9PfrmhhEFA&s=19

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @RichardMCNgo

Richard Ngo

@RichardMCNgo

Aug 17

As the field of machine learning advances, the alignment problem is becoming increasingly concrete. In a new report, I present the key high-level arguments for expecting AGI to be misaligned, grounded in technical details of deep learning. drive.google.com/file/d/1TsB7Wm… Thread:

To start, why think about this now? ML progress has been astonishingly rapid over the last decade. And surveys show many top ML researchers expect AGI within decades. So if misalignment is plausible we should do our best to understand the core problem. aiimpacts.org/2022-expert-su…

Alignment arguments are often very abstract because we lack empirical data or solid theory for AGI. But that's exactly why the problem seems hard to solve, and why we should work to build those theories in advance! If we don't, we may have little time to catch up as we near AGI.

Read 22 tweets

Richard Ngo

@RichardMCNgo

Aug 2

Hypothesis: people change their minds slowly because many of their beliefs are heavily entangled with their broader worldview. To persuade them, focus less on finding evidence for specific claims, and more on providing an alternative worldview in which that evidence makes sense.

Empirical disagreements often segue into moral disagreements because worldviews tend to combine both types of claims. For people to adopt a new worldview, they need to see how it can guide their actions towards goals and values they care about.
clearerthinking.org/amp/understand…

This frame makes slow updating more rational. Coherent decision-making is hard when using a worldview with many exceptions or poor moral foundations. When it's hard to synthesize a new combined worldview you should often still bet on your current one.

https://twitter.com/RichardMCNgo/status/1547415806501392386?t=badOsM5ZRkXq8ibZ4FqZAg&s=19

Read 6 tweets

Richard Ngo

@RichardMCNgo

Jul 17

Whenever I overhear someone at a party saying "so what do you do?" I feel a twinge of second-hand shame. Surely, as a society, we can just agree to ask each other better questions than that.

IMO also avoid "what's your story?" and "so how do you know [host]?"

Better options:
- What have you been excited about lately?
- What about you tends to surprise people?
- What do you enjoy/care about?
- What have you been looking for lately?
- What are you looking forward to?

People have many objections but I still think "what do you do" produces:
- status dynamics of judging each other by jobs
- the same conversational paths
- easy ways to pigeonhole people

Small changes by askers would prevent the burden of avoiding these from falling on answerers.

Read 6 tweets

Richard Ngo

@RichardMCNgo

Apr 10

Asked a ML researcher friend what would convince her that large models aren't "just reflecting their data" - she said understanding novel similes and metaphors.

Here's one that GPT-3 basically nailed - but it's probably not fully novel. Other suggestions?

This one is probably more novel. I think I'd actually bet that this is better than the median human response.

(All of these I'm rerolling 2-5 times btw, so a little bit of cherrypicking.)

@SianGooding

Another suggested by @SianGooding (the friend I was originally debating).

Gave it a few tries at this because I have no idea what it means myself.

Read 8 tweets

Richard Ngo

@RichardMCNgo

Apr 2

New York’s nightclubs are the particle accelerators of sociology: reliably creating precise conditions under which exotic extremes of status-seeking behaviour can be observed.

Ashley Mears, model turned sociology professor, wrote a great book about them. Thread of key points:

Some activities which are often fun - dancing, drinking, socialising - become much more fun when you're also feeling high-status. So wealthy men want to buy the feeling of high-status fun, by partying alongside other high-status people, particularly beautiful women. 2/N

But explicit transactions between different forms of cultural capital are low-status - it demonstrates that you can’t get the other forms directly. So the wealthy men can’t just pay the beautiful women to come party with them. 3/N

Read 18 tweets

Richard Ngo

@RichardMCNgo

Feb 12

Thread of wacky predictions about the future. I'm trying more to raise interesting possibilities than to hone in on the most likely ones - but also don't write them off too easily either!

Our descendants will live in virtual environments with many more than 3 dimensions. They'll look back on us with a similar mix of pity and bewilderment as we have when reading about Flatland.

We'll eventually be able to predict human social dynamics in a scientific way, using high-level principles (like in economics).

We'll never do the same for society as a whole though because of reflexivity: the (superhuman) minds doing the understanding themselves affect society!

Read 10 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Separate emails with commas Message

Share this page!

Richard Ngo

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @RichardMCNgo

Richard Ngo

Richard Ngo

Richard Ngo

Richard Ngo

Richard Ngo

Richard Ngo

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!