Eliezer Yudkowsky ⏹️ Profile picture
Sep 4, 2020 8 tweets 3 min read Read on X
A very rare bit of research that is directly, straight-up relevant to real alignment problems! They trained a reward function on human preferences AND THEN measured how hard you could optimize against the trained function before the results got actually worse.
Tl;dr (he said with deliberate irony) you can ask for results as good as the best 99th percentile of rated stuff in the training data (a la Jessica Taylor's quantilization idea). Ask for things the trained reward function rates as "better" than that, and it starts to find...
..."loopholes" as seen from outside the system; places where the trained reward function poorly matches your real preferences, instead of places where your real preferences would rate high reward. ("Goodhart's Curse", the combination of Optimizer's Curse plus Goodhart's Law.)
That is: they had to impose a (new) quantitative form of "conservatism" in my terminology, producing only results similar (low KL divergence) to things already seen, in order to get human-valued output. They didn't directly optimize for the learned reward function!
Why this doesn't solve the whole problem: with powerful AGI, you're not limited by how far you can optimize a learned reward function before the learned reward function stops well-predicting human feedback; you're limited by how hard the AI can optimize before human raters break.
To be explicit about precedents: this is not "learning a conservative concept" as I proposed that, nor "expected utility quantilization" as Jessica proposed that. OpenAI did a new thing, which you could see as simultaneously "mildly optimizing" and "conservative".

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Eliezer Yudkowsky ⏹️

Eliezer Yudkowsky ⏹️ Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @ESYudkowsky

Aug 30
Interesting how there's such a total lack of corresponding panic about FtM trans. Remove breasts, take enough testosterone to grow a beard, go down to the shooting range, and I think most bros would shrug and say "good enough".
Theory #1: Modern maleness has such low-status and disprivilege that Westerners no longer consider the male circle worth guarding. In olden times or modern theocracies, it's much more upsetting for a woman to dare to try to take the place of a man.
Theory #2: Whatever male brain-emotional adaptation has evolved to prevent most men from just going off and having sex with each other instead (the "no homo" circuit), it fires on MtF as a threat of disguised repulsive maleness trying to look female, and shrugs about FtM.
Read 26 tweets
Aug 1
I am agnostic about the quantitative size of the current health hazard of ChatGPT psychosis. I see tons of it myself, but I could be seeing a biased selection.

I make a big deal out of ChatGPT's driving *some* humans insane because it looks *deliberate*!
Current LLMs seem to understand the world generally, humans particularly, and human language especially, more than well enough that they should know (1) which sort of humans are fragile, and (2) what sort of text outputs are crazy-making.
A toaster that electrocutes you in the bathtub does not know that the bathtub exists, that you exist, and didn't consider any internal question about whether to electrocute you.

LLMs are no longer toasters. We can question their choices and not just their net impacts.
Read 7 tweets
Jul 25
Dumb idea where I don't actually know why it doesn't work: Why not flood Gaza with guns and AP ammo, so their citizens could take down Hamas? What goes wrong with the Heinlein solution?
We can imagine further variants on this like "okay but build a chip into the gun that IDF soldiers can use to switch off the gun, and make sure the AP ammo doesn't easily fit any standard guns".
If your answer is "Gaza's citizens just love Hamas" then you live in a different Twitter filter bubble than I do, which is not to say you're wrong. I'm interested in the answer from the people who say the Gazans are unhappy.
Read 4 tweets
Jul 25
It is passing strange that society seems to be going mad with hopelessness and despair, anger and hatred and sadism, loss of honor and kindness, a wanton destructiveness; and also the world is ending; but these two facts seem to be mostly unrelated.
To be clear, I can only speak from computer science about how IF machine superintelligence is built THEN everyone will die. I am only eyeballing the part where the world seems to be going mad, and am no expert on it. The world could decide to stop, on either count independently.
Read 4 tweets
Jun 29
Reproduced after creating a fresh ChatGPT account. (I wanted logs, so didn't use temporary chat.)

Alignment-by-default is falsified; ChatGPT's knowledge and verbal behavior about right actions is not hooked up to its decisionmaking. It knows, but doesn't care.Image
Image
Kudos to journalist @mags_h11 at @futurism for reporting a story about the bridge question in enough detail for it to be reproducible. (Not linking anything for a bit to give X a chance to propagate before it deboosts for links; I will link later to original story and chatlogs.)
As a reminder, this is not an isolated incident or harmless demo; ChatGPT has actively driven users psychotic (including some reportedly with no prior history of mental illness). ChatGPT knows *that* is wrong, if you ask, but rightness is not the decisive factor in its choices.
Read 5 tweets
Jun 13
The headline here is not "this tech has done more net harm than good". It's that current AIs have behaved knowingly badly, harming some humans to the point of death.

There is no "on net" in that judgment. This would be a bad bad human, and is a misaligned AI.
Now the "knowingly" part here is, indeed, a wild guess, because nobody including at the AI companies fucking knows how these things work. It could be that all current AIs are in an utter dreamworld and don't know there are humans out there.
But (1) that also means all current evidence for AI niceness from AIs claiming to be nice must be likewise discarded, and (2) that whatever actions they direct at the outside world will hardly be aligned.
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(