Jason Wei Profile picture
ai researcher @openai
May 3, 2023 12 tweets 4 min read
Since GPT-4, some have argued that emergence in LLMs is overstated, or even a "mirage". I don't think these arguments debunk emergence, but they warrant discussion (it's generally good to examine scientific phenomena critically).

A blog post: jasonwei.net/blog/common-ar…

🧵⬇️ Image Argument 1: Emergence occurs for “hard” evaluation metrics like exact match or multiple-choice, and if you use metrics that award partial credit, then performance improves smoothly (arxiv.org/abs/2304.15004).
Mar 16, 2023 9 tweets 3 min read
I’m hearing chatter of PhD students not knowing what to work on.
My take: as LLMs are deployed IRL, the importance of studying how to use them will increase.
Some good directions IMO (no training):
1. prompting
2. evals
3. LM interfaces
4. safety
5. understanding LMs
6. emergence 1. Prompting research. Maybe hot take, but I think we’ve just reached the tip of the iceberg on the best ways to prompt language models. As language model capabilities increase, the degrees of freedom for guiding a particular generation via a good prompt will increase.
Mar 13, 2023 5 tweets 2 min read
Hot take supported by evidence: for a given NLP task, it is unwise to extrapolate performance to larger models because emergence can occur.

I manually examined all 202 tasks in BIG-Bench, and the most common category was for the scaling behavior to *unpredictably* increase. So the idea that emergent/unpredictable scaling behavior is "cherrypicked" is simply untrue.

However, it is true that loss on a broad test set or aggregate performance on BIG-Bench can improve predictably. But for a single downstream task this is simply not the case.
Feb 9, 2023 8 tweets 2 min read
Studying emergent abilities of language models can seem elusive for researchers without have access to Google/DeepMind models.

A 🧵 with some unexplored ideas to study emergence using (1) the free codex API, (2) flan-t5, or (3) big-bench paper analysis.

(1) Many don't know, but the code-* API is free, and you can run three sizes of models: curie-001, davinci-001, and davinci-002. Davinci-002 is comparable with PaLM.

To get more model scales, small models such as text-ada-001 or ada-curie can be evaluated for relatively cheap.
Jan 25, 2023 6 tweets 3 min read
Yesterday I gave a lecture at @Stanford's CS25 class on Transformers!

The lecture was on how “emergent abilities” are unlocked by scaling up language models. Emergence is one of the most exciting phenomena in large LMs…

Slides: docs.google.com/presentation/d… Throughout the past year, there have been hundreds of emergent abilities, which can only be observed in large-enough language models. I previously made a list of them (more than 100):

Jun 12, 2022 7 tweets 2 min read
Now promoting a consciousness piece I wrote before:

If language models can generate a "stream of consciousness" indistinguishable from humans, why aren't they conscious?

jasonwei20.github.io/files/artifici… Image 1. I first argue that language models like GPT-3 can generate a stream of thought similar to how cascade of thoughts that seem to arise in our minds.