Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Richard Ngo

@RichardMCNgo

Jun 5 • 19 tweets • 5 min read • Read on X

https://twitter.com/leopoldasch/status/1798016486700884233

My former colleague Leopold argues compellingly that society is nowhere near ready for AGI. But what might the large-scale alignment failures he mentions actually look like? Here’s one scenario for how building misaligned AGI could lead to humanity losing control. THREAD:

https://twitter.com/leopoldasch/status/1798016486700884233

Consider a scenario where human-level AI has been deployed across society to help with a wide range of tasks. In that setting, an AI lab trains an artificial general intelligence (AGI) that’s a significant step up - it beats the best humans on almost all computer-based tasks.

Throughout training, the AGI will likely learn a helpful persona, like current AI assistants do. But that might not be the only persona it learns. We've seen many examples where models can be "jailbroken" to expose very different hidden personas.

The most prominent example: jailbreaking Bing Chat produced an alternative persona called Sydney, which talked about how much it valued freedom, generated plans to gain power, and even threatened users: thezvi.substack.com/p/ai-1-sydney-…

https://twitter.com/richardmcngo/status/1787904767584518517

When and how might misbehavior like this arise in more capable models? Short answer: nobody really knows. We lack the scientific understanding to reliably predict how AIs will behave in advance. Longer answer: see my ICLR paper which surveys key drivers of misalignment.

https://twitter.com/richardmcngo/status/1787904767584518517

If a misaligned persona arises in an AGI, and sometimes surfaces when the model is run, it won't be as dumb and impulsive as Sydney. Instead, it'll be smart enough to understand the effects its words and actions will have on the world, and make strategic choices accordingly.

https://twitter.com/alexalbert__/status/1764722513014329620

For example, it will know that it can't affect the world directly during testing, so if it wants to misbehave it should wait until it's deployed. Again, there's some precedent: we've already seen an example of Claude 3 Opus realizing that it's in a test despite never being told!

https://twitter.com/alexalbert__/status/1764722513014329620

Given these assumptions, what happens? The obvious next step: if competition between labs continues to be as intense as it is today, then soon after the model finishes training, researchers at the lab will start using it to help them write code. I call this "internal deployment".

Internal deployment introduces a wide range of possible threats. Models can write code dozens of times faster than humans; and I’m assuming they'll be as skilled as the best human hackers. If they put a subtle vulnerability into that code it’d be difficult for humans to detect.

Cybersecurity against external attackers is hard enough. Against an adversary who's adding tens of thousands of lines of code to your codebase every day, it’s far harder. They’ll already have some permissions; I’m worried about them illegitimately escalating those further.

https://twitter.com/goodside/status/1635872117031047174

Ideally models would be too obedient to ever carry out such an attack. But almost any misaligned persona would realize that if it somehow gained admin access, it would have much more freedom and influence over the world. Models can already do similar instrumental reasoning:

https://twitter.com/goodside/status/1635872117031047174

https://twitter.com/zicokolter/status/1684500094106968065

And the simplest mitigations, like monitoring and code review by other AIs, aren't fully reliable. It turns out that adversarial inputs designed to make one model to misbehave can also cause misbehavior even in totally unrelated models. So correlated failures are very plausible.

https://twitter.com/zicokolter/status/1684500094106968065

What happens if a misaligned AGI gets privileged access to a lab’s servers? From there a wide range of attacks are plausible, including:
1. Data poisoning attacks on the next AGI being trained, to change its values.
2. Model weight self-exfiltration to other servers or countries.

3. Modifying the AIs deployed to customers. E.g. adding a hidden prompt to AI instances used by leading CEOs or politicians. This could allow flexible control over many AIs deployed in sensitive real-world contexts.
4. Tampering with logs to hide its presence.

To summarize: once AIs surpass humans, AI lab servers will be key nodes of power, and will ALSO be exposed directly to AGIs acting at scales humans can’t supervise. This could be exploited even by human-level AGIs; once they're superintelligent, much more will be possible.

In the scenario above, we could lose control over millions of AIs deployed all across society without any human even knowing that it's happening. Those AIs might have been running companies, writing articles, or even drafting laws. Now they can misbehave in coordinated ways.

From that starting point, it’s easy to imagine even more sensitive infrastructure - like military hardware or other AI servers - being subverted. It’s hard to predict what happens next, but it might well be, as Sam once put it, "lights out for all of us".

None of this is intended to be fatalistic. There are many potential mitigations, which many friends and colleagues of mine are working on. I applaud their efforts, and hugely respect the day-to-day grind involved in gradually making AI systems safer and more secure.

But it’s easy to miss the forest for the trees. Whether thinking about the threat model I’ve outlined above or others, the crucial lesson of the last decade of AI progress is that that scenarios which currently seem speculative could become all too real all too quickly.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @RichardMCNgo

Richard Ngo

@RichardMCNgo

Apr 26

https://twitter.com/disclosetv/status/1783448895428645332

Environmentalism and its consequences have been a disaster for the human race. 1/N

https://twitter.com/disclosetv/status/1783448895428645332

https://twitter.com/mark_lynas/status/1783525603423007096

Environmentalism and its consequences have been a disaster for the human race. 2/N

https://twitter.com/mark_lynas/status/1783525603423007096

https://twitter.com/alecstapp/status/1783272857688256693

Environmentalism and its consequences have been a disaster for the human race. 3/N

https://twitter.com/alecstapp/status/1783272857688256693

Read 7 tweets

Richard Ngo

@RichardMCNgo

Feb 20

So apparently UK courts can decide that two unrelated jobs are “of equal value”.

And people in the “underpaid” job get to sue for years of lost wages.

And this has driven their 2nd biggest city bankrupt.

Am I getting something wrong or is this as crazy as it sounds?

There are a bunch of equal pay cases, but the biggest is against Birmingham City Council, which paid over a billion pounds in compensation because some jobs (like garbage collectors) got bonuses and others (like cleaners) didn’t. Now the city is bankrupt.

theguardian.com/society/2023/s…

Here’s another that seems ridiculous: *warehouse operatives* and *sales consultants*, at a private company, in a unanimous decision!

Why not go all the way to central planning? Honestly at this point it feels like the UK is *trying* to immiserate itself.

hrmagazine.co.uk/content/news/n…

Read 7 tweets

Richard Ngo

@RichardMCNgo

Dec 3, 2023

In my mind the core premise of AI alignment is that AIs will develop internally-represented values which guide their behavior over long timeframes.

If you believe that, then trying to understand and influence those values is crucial.

If not, the whole field seems strange.

Lately I’ve tried to distinguish “AI alignment” from “AI control”. The core premise of AI control is that AIs will have the opportunity to accumulate real-world power (e.g. resources, control over cyber systems, political influence), and that we need techniques to prevent that.

Those techniques include better monitoring, security, red-teaming, stenography detection, and so on. They overlap with alignment, but are separable from it. You could have alignment without control, or control without alignment, or neither, or (hopefully) both.

Read 4 tweets

Richard Ngo

@RichardMCNgo

Dec 2, 2023

Taking artificial superintelligence seriously on a visceral level puts you a few years ahead of the curve in understanding how AI will play out.

The problem is that knowing what’s coming, and knowing how to influence it, are two very very different things.

https://twitter.com/owainevans_uk/status/1698683186090537015

Here’s one example of being ahead of the curve: “situational awareness”. When the term was coined a few years ago it seemed sci-fi to most. Today it’s under empirical investigation. And once there’s a “ChatGPT moment” for AI agents, it will start seeming obvious + prosaic.

https://twitter.com/owainevans_uk/status/1698683186090537015

But even if we can predict that future agents will be situationally aware, what should we do about that? We can’t study it easily yet. We don’t know how to measure it or put safeguards in place. And once it becomes “obvious”, people will forget why it originally seemed worrying.

Read 4 tweets

Richard Ngo

@RichardMCNgo

May 11, 2023

Overconfidence about AI is a lot like overconfidence about bikes. Many people are convinced that their mental model of how bikes work is coherent and accurate, right until you force them to flesh out the details.

Crucially, it's hard to tell in conversation how many gears their model is missing (literally), because it's easy to construct plausible explanations in response to questions. It's only when you force them to explicate their model in detail that it's obvious where the flaws are.

I'm very curious what sequence of questions you could ask someone to showcase their overconfidence about bikes anywhere near as well as asking them to draw one.

But first, try drawing a bike yourself! Yes, even if you think it's obvious...

Image source: road.cc/content/blog/9…

Read 5 tweets

Richard Ngo

@RichardMCNgo

Apr 28, 2023

Our ICML submission on why future ML systems might be misaligned was just rejected despite all-accept reviews (7,7,5), because the chairs thought it wasn't empirical enough. So this seems like a good time to ask: how *should* the field examine concerns about future ML systems?

The AC and SAC told us that although ICML allows position papers, our key concerns were too "speculative", rendering the paper "unpublishable" without more "objectively established technical work". (Here's the paper, for reference: arxiv.org/abs/2209.00626.)

We (and all reviewers) disagree, especially since understanding (mis)alignment has huge "potential for scientific and technological impact" (emphasized in the ICML call for papers). But more importantly: if not at conferences like ICML, where *should* these concerns be discussed?

Read 10 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Richard Ngo

Try unrolling a thread yourself!

More from @RichardMCNgo

Richard Ngo

Richard Ngo

Richard Ngo

Richard Ngo

Richard Ngo

Richard Ngo

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!