Over the last few years in particular there's been a proliferation of rigorous theoretical and empirical research on reasons to be concerned about existential risk from AI.
In this thread I list some of the papers published in top ML venues which explore key ideas:
This paper describes and studies a key reason for concern: that a model's capabilities might generalize out-of-distribution without its goals generalizing in desirable ways.
This paper argues that advanced RL agents will learn to care about feedback mechanisms used during training, since this will predictably get higher reward.
It *is* true that there's no peer-reviewed publication which unites the ideas in these papers into a single cohesive argument for the plausibility of AI takeover. IMO the most empirically-grounded presentation of the overall argument is my recent paper:
As the field of machine learning advances, the alignment problem is becoming increasingly concrete. In a new report, I present the key high-level arguments for expecting AGI to be misaligned, grounded in technical details of deep learning. drive.google.com/file/d/1TsB7Wm… Thread:
To start, why think about this now? ML progress has been astonishingly rapid over the last decade. And surveys show many top ML researchers expect AGI within decades. So if misalignment is plausible we should do our best to understand the core problem. aiimpacts.org/2022-expert-su…
Alignment arguments are often very abstract because we lack empirical data or solid theory for AGI. But that's exactly why the problem seems hard to solve, and why we should work to build those theories in advance! If we don't, we may have little time to catch up as we near AGI.
Hypothesis: people change their minds slowly because many of their beliefs are heavily entangled with their broader worldview. To persuade them, focus less on finding evidence for specific claims, and more on providing an alternative worldview in which that evidence makes sense.
Empirical disagreements often segue into moral disagreements because worldviews tend to combine both types of claims. For people to adopt a new worldview, they need to see how it can guide their actions towards goals and values they care about. clearerthinking.org/amp/understand…
This frame makes slow updating more rational. Coherent decision-making is hard when using a worldview with many exceptions or poor moral foundations. When it's hard to synthesize a new combined worldview you should often still bet on your current one.
Whenever I overhear someone at a party saying "so what do you do?" I feel a twinge of second-hand shame. Surely, as a society, we can just agree to ask each other better questions than that.
IMO also avoid "what's your story?" and "so how do you know [host]?"
Better options:
- What have you been excited about lately?
- What about you tends to surprise people?
- What do you enjoy/care about?
- What have you been looking for lately?
- What are you looking forward to?
People have many objections but I still think "what do you do" produces:
- status dynamics of judging each other by jobs
- the same conversational paths
- easy ways to pigeonhole people
Small changes by askers would prevent the burden of avoiding these from falling on answerers.
Asked a ML researcher friend what would convince her that large models aren't "just reflecting their data" - she said understanding novel similes and metaphors.
Here's one that GPT-3 basically nailed - but it's probably not fully novel. Other suggestions?
This one is probably more novel. I think I'd actually bet that this is better than the median human response.
(All of these I'm rerolling 2-5 times btw, so a little bit of cherrypicking.)
Another suggested by @SianGooding (the friend I was originally debating).
Gave it a few tries at this because I have no idea what it means myself.
New York’s nightclubs are the particle accelerators of sociology: reliably creating precise conditions under which exotic extremes of status-seeking behaviour can be observed.
Ashley Mears, model turned sociology professor, wrote a great book about them. Thread of key points:
Some activities which are often fun - dancing, drinking, socialising - become much more fun when you're also feeling high-status. So wealthy men want to buy the feeling of high-status fun, by partying alongside other high-status people, particularly beautiful women. 2/N
But explicit transactions between different forms of cultural capital are low-status - it demonstrates that you can’t get the other forms directly. So the wealthy men can’t just pay the beautiful women to come party with them. 3/N
Thread of wacky predictions about the future. I'm trying more to raise interesting possibilities than to hone in on the most likely ones - but also don't write them off too easily either!
Our descendants will live in virtual environments with many more than 3 dimensions. They'll look back on us with a similar mix of pity and bewilderment as we have when reading about Flatland.
We'll eventually be able to predict human social dynamics in a scientific way, using high-level principles (like in economics).
We'll never do the same for society as a whole though because of reflexivity: the (superhuman) minds doing the understanding themselves affect society!