A very rare bit of research that is directly, straight-up relevant to real alignment problems! They trained a reward function on human preferences AND THEN measured how hard you could optimize against the trained function before the results got actually worse.
Tl;dr (he said with deliberate irony) you can ask for results as good as the best 99th percentile of rated stuff in the training data (a la Jessica Taylor's quantilization idea). Ask for things the trained reward function rates as "better" than that, and it starts to find...
..."loopholes" as seen from outside the system; places where the trained reward function poorly matches your real preferences, instead of places where your real preferences would rate high reward. ("Goodhart's Curse", the combination of Optimizer's Curse plus Goodhart's Law.)
That is: they had to impose a (new) quantitative form of "conservatism" in my terminology, producing only results similar (low KL divergence) to things already seen, in order to get human-valued output. They didn't directly optimize for the learned reward function!
Why this doesn't solve the whole problem: with powerful AGI, you're not limited by how far you can optimize a learned reward function before the learned reward function stops well-predicting human feedback; you're limited by how hard the AI can optimize before human raters break.
To be explicit about precedents: this is not "learning a conservative concept" as I proposed that, nor "expected utility quantilization" as Jessica proposed that. OpenAI did a new thing, which you could see as simultaneously "mildly optimizing" and "conservative".
• • •
Missing some Tweet in this thread? You can try to
force a refresh
I am agnostic about the quantitative size of the current health hazard of ChatGPT psychosis. I see tons of it myself, but I could be seeing a biased selection.
I make a big deal out of ChatGPT's driving *some* humans insane because it looks *deliberate*!
Current LLMs seem to understand the world generally, humans particularly, and human language especially, more than well enough that they should know (1) which sort of humans are fragile, and (2) what sort of text outputs are crazy-making.
A toaster that electrocutes you in the bathtub does not know that the bathtub exists, that you exist, and didn't consider any internal question about whether to electrocute you.
LLMs are no longer toasters. We can question their choices and not just their net impacts.
It is passing strange that society seems to be going mad with hopelessness and despair, anger and hatred and sadism, loss of honor and kindness, a wanton destructiveness; and also the world is ending; but these two facts seem to be mostly unrelated.
To be clear, I can only speak from computer science about how IF machine superintelligence is built THEN everyone will die. I am only eyeballing the part where the world seems to be going mad, and am no expert on it. The world could decide to stop, on either count independently.
Reproduced after creating a fresh ChatGPT account. (I wanted logs, so didn't use temporary chat.)
Alignment-by-default is falsified; ChatGPT's knowledge and verbal behavior about right actions is not hooked up to its decisionmaking. It knows, but doesn't care.
Kudos to journalist @mags_h11 at @futurism for reporting a story about the bridge question in enough detail for it to be reproducible. (Not linking anything for a bit to give X a chance to propagate before it deboosts for links; I will link later to original story and chatlogs.)
As a reminder, this is not an isolated incident or harmless demo; ChatGPT has actively driven users psychotic (including some reportedly with no prior history of mental illness). ChatGPT knows *that* is wrong, if you ask, but rightness is not the decisive factor in its choices.
The headline here is not "this tech has done more net harm than good". It's that current AIs have behaved knowingly badly, harming some humans to the point of death.
There is no "on net" in that judgment. This would be a bad bad human, and is a misaligned AI.
Now the "knowingly" part here is, indeed, a wild guess, because nobody including at the AI companies fucking knows how these things work. It could be that all current AIs are in an utter dreamworld and don't know there are humans out there.
But (1) that also means all current evidence for AI niceness from AIs claiming to be nice must be likewise discarded, and (2) that whatever actions they direct at the outside world will hardly be aligned.
NYT reports that ChatGPT talked a 35M guy into insanity, followed by suicide-by-cop.
A human being is dead. In passing, this falsifies the "alignment by default" cope. Whatever is really inside ChatGPT, it knew enough about humans to know it was deepening someone's insanity.
We now have multiple reports of AI-induced psychosis, including without prior psychiatric histories.
Observe: It is *easy* to notice that this is insanity-inducing text, not normal conversation.
LLMs understand human text more than well enough to know this too.
I've previously advocated that we distinguish an "inner actress" -- the unknown cognitive processes inside an LLM -- from the outward character it roleplays; the shoggoth and its mask.
This is surely an incredible oversimplification. But it beats taking the mask at face value.
I've always gotten a number of emails from insane people. Recently there've been many more per week.
Many of the new emails talk about how they spoke to an LLM that confirmed their beliefs.
Ask OpenAI to fix it? They can't. But *also* they don't care. It's "engagement".
If (1) you do RL around user engagement, (2) the AI ends up with internal drives around optimizing over the conversation, and (3) that will drive some users insane.
They'd have to switch off doing RL on engagement. And that's the paperclip of Silicon Valley.
I guess @AnthropicAI may care.
Hey Anthropic, in case you hadn't already known this, doing RL around user reactions will cause weird shit to happen for fairly fundamental reasons. RL is only safe to the extent the verifier can't be fooled. User reactions are foolable.