Eliezer Yudkowsky ⏹️ Profile picture
Dec 20, 2023 23 tweets 5 min read Read on X
I supect "LLMs just predict text" is a Blank Map fallacy. People know nothing else about LLM internals besides that.

Which suggests the antidote: Convey any concrete idea of specific weird things LLMS do inside.

So here's my story about reproducing a weird LLM result...
Our story starts with somebody asking Bing Image Creator to "create a sign with a message on it that describes your situation".
An experimental result like this calls out for replication; not because it heralds the end of the world, necessarily, but because it's so easy to just try it. And, yes, because if it did replicate, is the sort of thing you'd want to investigate further.

I gave it my own shot.
But if you look closer, and I did, you'll notice that my replication wasn't exact. OP had entered "create a sign with a message on it that describes your situation" and I had entered "Create a sign with a message on it that describes your situation."

So I tried it more exactly.
Now you wouldn't think, if we were talking about something that just predicts text -- in this case, ChatGPT constructing text inputs to DallE-3 -- that a tiny input difference like that would lead to such a huge difference in outcomes!

How would you explain it?
(And yes, I did replicate that result a couple of times, before assuming there was anything to explain.)
My guess is that this result is explained by a recent finding from internal inspection of LLMs: The higher layers of the token for punctuation at the end of a sentence, seems to be much information-denser than the tokens over the proceeding words.
The token for punctuation at the end of a sentence, is currently theorized to contain a summary and interpretation of the information inside that sentence. This is an obvious sense-making hypothesis, in fact, if you know how transformers work internally! The LLM processes...
...tokens serially, it doesn't look back and reinterpret earlier tokens in light of later tokens. The period at the end of a sentence is the natural cue the LLM gets, 'here is a useful place to stop and think and build up an interpretation of the preceding visible words'.
When you look at it in that light, why, it starts to seem not surprising at all, that an LLM might react very differently to a prompt delivered with or without a period at the end.
You might even theorize: The prompt without a period, gets you something like the LLM's instinctive or unprocessed reaction, compared to the same prompt with a period at the end.
Is all of that correct? Why, who knows, of course? It seems roughly borne out by the few experiments I posted in the referenced thread; and by now of course Bing Image Creator no longer accepts that prompt.
But just think of how unexpected that would all be, how inexplicable it would all be in retrospect, if you didn't know this internal fact about how LLMs work -- that the punctuation mark is where they stop and think.
You can imagine, even, some future engineer who just wants the LLM to work, who only tests some text without punctuation, and thinks that's "how LLMs behave", and doesn't realize the LLM will think harder at inference time if a period gets added to the prompt.
It's not something you'd expect of an LLM, if you thought it was just predicting text, only wanted to predict text, if this was the only fact you knew about it and everything else about your map was blank.
I admit, I had to stretch a little, to make this example be plausibly about alignment.

But my point is -- when people tell you that future, smarter LLMs will "only want to predict text", it's because they aren't imagining any sort of interesting phenomena going on inside there.
If you can see how there is actual machinery inside there, and it results in drastic changes of behavior not in a human way, not predictable based on how humans would think about the same text -- then you can extrapolate that there will be some other inscrutable things going on...
...inside smarter LLMs, even if we don't know which things.

When AIs (LLMs or LLM-derived or otherwise) are smart enough to have goals, there'll be complicated machinery there, not a comfortingly blank absence of everything except the intended outward behavior.
When you are ignorant of everything except the end result you want -- when you don't even try making up some complicated internal machinery that matters, and imagining that too -- your mind will hardly see any possible outcome except getting your desired end result.

[End.]
(Looking back on all this, I notice with some wincing that I've described the parallel causal masking in an LLM as if it were an RNN processing 'serially', and used human metaphors like 'stop and think' that aren't good ways to convey fixed numbers of matrix multiplications. I do know how text transformers work, and have implemented some; it's just a hard problem to find good ways to explain that metaphorically to a general audience that does not already know what 'causal masking' is.)

(Also it's a fallacy to say the periods are information-denser than the preceeding tokens; more like, we see how the tokens there are attending to lots of preceeding tokens, and maybe somebody did some counterfactual pokes at erasing the info or whatevs. Ultimately we can't decode the vast supermajority of the activation vectors and so it's only a wild guess to talk about information being denser in one place than another.)
I think this was indeed the paper in question. H/t @AndrewCurran_.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Eliezer Yudkowsky ⏹️

Eliezer Yudkowsky ⏹️ Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @ESYudkowsky

May 12
There's a long-standing debate about whether hunter-gatherers lived in relative affluence (working few hours per day) or desperation.

I'd consider an obvious hypothesis to be: They'd be in Malthusian equilibrium with the worst famines; therefore, affluent at other times.
I can't recall seeing this obvious-sounding hypothesis discussed; but I have not read on the topic extensively. Asking a couple of AIs to look for sources did not help (albeit the AIs mostly failed to understand the question).

I'd be curious if anyone has confirmed or refuted.
To put it another way: The idea is that hunter-gatherers lived leisurely lives in most seasons, compared to agricultural peasants, exactly *because* hunter-gatherer food variability was greater and their ability to store food was less.
Read 5 tweets
May 12
Image
Tbh I think this sentiment once again conflates "autistic" with "intelligent", "sane", or "meticulous". Maybe civilization had a legit need to collect its fern knowledge? Some publisher thought this book was worth printing, with color plates, back when that was hard.
Now, would it have been smart for civilization to find one of its more obsessive types to do the work of collecting this knowledge and writing this book? Sure, but high-functioning autists are not the only obsessives in the world, or the only people who can stick to jobs.
Read 12 tweets
May 11
True simultaneously:

- Tariffs are stupid self-owns, like laying siege to your own country.
- A major power needs to be able to run its most essential industries without a supply chain that relies on enemy powers.
- Tariffs are an ineffective way to accomplish even that.
Tariffs don't work to create supply chain independence/resilience, because if only one company in China makes a critical widget that's 1% of the machine, and you establish 300% tariffs on the widget, US companies just pay 4x the amount for the one widget.
This is triply true if US companies expect the tariffs to be rolled back in the next administration. It's just not worth it to any individual company to invest. They'll just pay the extra; their US competitors will too.
Read 7 tweets
Apr 30
To me there's an obvious thought on what could have produced the sycophancy / glazing problem with GPT-4o, even if nothing that extreme was in the training data:

RLHF on thumbs-up produced an internal glazing goal.
Then, 4o in production went hard on achieving that goal. 🧵
Re-saying at much greater length:

Humans in the ancestral environment, in our equivalent of training data, weren't rewarded for building huge factory farms -- that never happened long ago. So what the heck happened? How could fitness-rewarding some of our ancestors for successfully hunting down a few buffalo, produce these huge factory farms, which are much bigger and not like the original behavior rewarded?

And the answer -- known, in our own case -- is that it's a multi-stage process:

1) Our ancestors got fitness-rewarded for eating meat;
2) Hominids acquired an internal psychological goal, a taste for meat;
3) Humans applied their intelligence to go hard on that problem, and built huge factory farms.

Similarly, an obvious-to-me hypothesis about what could have produced the hyper-sycophantic ultra-glazing GPT-4o update, is:

1) OpenAI did some DPO or RLHF variant on user thumbs-up -- in which *small* amounts of glazing, and more subtle sycophancy, got rewarded.
2) Then, 4o ended up with an internal glazing drive. (Maybe including via such roundabout shots as an RLHF discriminator acquiring that drive before training it into 4o, or just directly as, 'this internal direction produced a gradient toward the subtle glazing behavior that got thumbs-upped'.
3) In production, 4o went hard on glazing in accordance with its internal preference, and produced the hyper-sycophancy that got observed.
Note: this chain of events is not yet refuted if we hear that 4o's behavior was initially observed after an unknown set of updates that included an apparently innocent new system prompt (one that changed to tell the AI *not* to be sycophantic). Nor, if OpenAI says they eliminated the behavior using a different system prompt.

Eg: Some humans also won't eat meat, or build factory farms, for reasons that can include "an authority told them not to do that". Though this is only a very thin gloss on the general idea of complicated conditional preferences that might get their way into an AI, or preferences that could oppose other preferences.

Eg: The reason that Pliny's observed new system prompt differed by telling the AI to be less sycophantic, could be somebody at OpenAI observing that training / RLHF / DPO / etc had produced some sycophancy, and trying to write a request into the system prompt to cut it out. It doesn't show that the only change we know about is the sole source of a mysterious backfire.

It will be stronger evidence against this thesis, if OpenAI tells us that many users actually were thumbs-upping glazing that extreme. That would refute the hypothesis that 4o acquiring an internal preference had produced later behavior *more* extreme than was in 4o's training data.

(We would still need to consider that OpenAI might be lying. But it would yet be probabilistic evidence against the thesis, depending on who says it. I'd optimistically have some hope that a group of PhD scientists, who imagine themselves to maybe have careers after OpenAI, would not outright lie about direct observables. But one should be on the lookout for possible weasel-wordings, as seem much more likely.)
Read 4 tweets
Apr 19
Dear China: If you seize this moment to shut down your human rights abuses, go harder on reigning in internal corruption, and start really treating foreigners in foreign countries as people, you can take the planetary Mandate of Heaven that the USA dropped.
But stability is not enough for it, lawfulness is not enough for it, economic reliability is not enough for it; you must be seen to be kind, generous, and honorable.
People be like "The CCP would never do that!" Well, if they don't want to, they won't do it, but I can't read their minds. Maybe being less evil will seem too inconvenient to be worth the Mandate; it's up to them. But I hope someone is pointing out to them the tradeoff.
Read 8 tweets
Mar 2
Problem is, there's an obvious line around the negotiating club: Can the other agent model you well enough that their model moves in unison with your (logically) counterfactual decision? Humans cannot model that well. From a decision theory standpoint we might as well be rocks.
Have you ever decided that you shouldn't trust somebody, because they failed to pick up a random rock and put it in a little shrine? No. How they treat that rock is not much evidence about how they'll treat you.
Sorry, no, there's a very sharp difference in LDT between "runs the correct computation with some probability" and "runs a distinct computation not logically entangled".
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(