A recent clarity that I gained is viewing AI research as a “max-performance domain”, which means that you can be world-class by being very good at only one part of your job. As long as you can create seminal impact (e.g., train the best model, start a new paradigm, or create widely adopted benchmarks), it doesn’t matter if you’re incompetent at adjacent skills. For example, I have seen game-changing AI researchers have horrendous presentation skills, terrible political awareness, and who never think about their career progression. Heck, I even know a top AI researcher who probably wouldn’t pass a basic coding interview. But it doesn’t matter. Exceptional ability at a single thing outweighs incompetence at other parts of the job.
In max-performance domains, you don’t even need to be good at your one thing in a consistent way. An AI researcher can have tens of failed projects per year and still be successful if they produce a seminal work every few years. The metric is the best five works in your career, not the average.
A dangerous thing in max-performance domains is placing too much emphasis on role models. That’s because you don’t know whether you’re mimicking the good characteristics or not. For example, a top AI researcher can make a bad political move that turns out OK for them because of who they are. Or they can make a bold, unsubstantiated statement and expect other people to listen. But if anyone else had done the same thing, the outcome would be opposite.
Another way to view max-performance domains is that they have exponential upside and very little downside. That’s why interviews are especially useless in domains like AI research, because they tend to severely punish mistakes and don’t capture exponential value. An RL expert doesn’t need to know how SVMs work and probably hasn’t thought about it in years. A top AI infra engineer might lack basic knowledge about post-training data practices.
In my view it’s a luxury to work in a max-performance domain. Failure is allowed and stress is usually self-imposed. A thousand years ago, very few humans worked in max-performance domains, but now the opportunity is more available. Technology may have played a role in this shift, and with the progression of AI, hopefully more of humanity can move into max-performance domains.
(If you're wondering about what an example of a non-max-performance domain would be, it's any career where you must have both strengths and also basically no weaknesses. For example, a defender in soccer might cost their team the entire game with a single mistake. A piano player must master all parts of their concerto well, not just a single part.)
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Since GPT-4, some have argued that emergence in LLMs is overstated, or even a "mirage". I don't think these arguments debunk emergence, but they warrant discussion (it's generally good to examine scientific phenomena critically).
Argument 1: Emergence occurs for “hard” evaluation metrics like exact match or multiple-choice, and if you use metrics that award partial credit, then performance improves smoothly (arxiv.org/abs/2304.15004).
Response 1A: Sure you can find some metric that improves smoothly, but if the metric that improves in an emergent fashion is the one we ultimately care about, then that is what matters.
I’m hearing chatter of PhD students not knowing what to work on.
My take: as LLMs are deployed IRL, the importance of studying how to use them will increase.
Some good directions IMO (no training): 1. prompting 2. evals 3. LM interfaces 4. safety 5. understanding LMs 6. emergence
1. Prompting research. Maybe hot take, but I think we’ve just reached the tip of the iceberg on the best ways to prompt language models. As language model capabilities increase, the degrees of freedom for guiding a particular generation via a good prompt will increase.
2. Building evaluations. Many benchmarks get quickly saturated, and we need more to evaluate the frontier of language models. In addition, it’s still an open question of how to evaluate language models generally. The new OpenAI evals library could be good: github.com/openai/evals
Hot take supported by evidence: for a given NLP task, it is unwise to extrapolate performance to larger models because emergence can occur.
I manually examined all 202 tasks in BIG-Bench, and the most common category was for the scaling behavior to *unpredictably* increase.
So the idea that emergent/unpredictable scaling behavior is "cherrypicked" is simply untrue.
However, it is true that loss on a broad test set or aggregate performance on BIG-Bench can improve predictably. But for a single downstream task this is simply not the case.
For a list of the 67 tasks in BIG-Bench that are emergent, see
(1) Many don't know, but the code-* API is free, and you can run three sizes of models: curie-001, davinci-001, and davinci-002. Davinci-002 is comparable with PaLM.
To get more model scales, small models such as text-ada-001 or ada-curie can be evaluated for relatively cheap.
(1 cont.) It's possible to write entire papers just using the codex API for free, as many people have.
Throughout the past year, there have been hundreds of emergent abilities, which can only be observed in large-enough language models. I previously made a list of them (more than 100):
One of the most interesting emergent abilities IMO is instruction tuning.
Anthropic and Flan-LaMDA suggest that zero-shot performance can improve from RLHF and NLP benchmark instruction tuning (although text-davinci usually loses to code-davinci).
1. I first argue that language models like GPT-3 can generate a stream of thought similar to how cascade of thoughts that seem to arise in our minds.
2. This "artificial stream of thought" meets the “what it is like” definition of phenomenal consciousness, which states that something is consciousness if it is "like something" to be it.
We know what it is like to be GPT-3, just read its stream of consciousness!