@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab In this thread, find a summary of the work & some extensions (yes, the results hold for OpenAI o1!)
And note that we've condensed it to 12 pages - making it a much quicker read than the 84-page preprint!
2/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab Our big question: How can we develop a holistic understanding of large language models (LLMs)?
One popular approach has been to evaluate them w/ tests made for humans
But LLMs are not humans! The tests that are most informative about them might be different than for us
3/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab So how can we evaluate LLMs on their own terms?
We argue for a *teleological approach*, which has been productive in cognitive science: understand systems via the problem they adapted to solve
For LLMs this is autoregression (next-word prediction) over Internet text
4/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab By reasoning about next-word prediction, we make several hypotheses abt factors that'll cause difficulty for LLMs
1st is task frequency: we predict better performance on frequent tasks than rare ones, even when the tasks are equally complex
Eg, linear functions (see img)!
5/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab Another example: shift ciphers - decoding a message by shifting each letter N positions back in the alphabet.
On the Internet, the most common value for N is 13 (rot-13). Language models show a spike in accuracy at a shift of 13!
6/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab The 2nd factor we predict will influence LLM accuracy is output probability
Indeed, across many tasks, LLMs score better when the output is high-probability than when it is low-probability - even though the tasks are deterministic
E.g.: Reversing a list of words (see img)
7/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab Our results show that we should be cautious about applying LLMs in low-probability situations
We should also be careful in how we interpret evaluations. A high score on a test set may not indicate mastery of the general task, esp. if the test set is mainly high-probability
8/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab We previously released a preprint of this work. What's new since then? 1. Condensed the paper to 12 pages (it was 84!) 2. More models: Claude, Llama, Gemini (plus GPT-3.5 & GPT-4)
➡️ Also o1! (see below) 3. Enhanced discussion - thank you to our very thoughtful reviewers!
9/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab I think this could make a fun paper for a reading group or seminar. There’s a lot that could be discussed, and it’s pretty accessible (especially now that it’s been shortened!)
10/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab Now for the question that many have asked us: Does o1 still show these effects, given that it is optimized for reasoning?
To our surprise...it does! o1 shows big improvements but gets the same qualitative effects.
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab Regarding task frequency: o1 does much better on rare versions of tasks than previous models do (left plot). But, when the tasks are hard enough, it still does better on common task variants than rare ones (right two plots)
12/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab Regarding output probability: o1 shows clear effects here. Interestingly, the effects don’t just show up in accuracy (top) but also in how many tokens o1 consumes to perform the task (bottom)!
13/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab You might also wonder how chain-of-thought (CoT) affects things. Models w/ CoT still show memorization effects but also show hallmarks of true reasoning! Thus, CoT brings qualitative improvements but doesn't fully address the embers of autoregression
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab One downside of the condensed paper length is that we had to remove lots of references. Apologies to the many people whose excellent papers had to be cut due to length constraints!
15/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab Overall link roundup:
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab In conclusion: To understand what language models are, we must understand what we have trained them to be.
For much more, see the paper:
Work by @RTomMcCoy, @ShunyuYao12, @DanFriedman0, @MDAHardy, and Tom Griffiths @cocosci_lab
Bayesian models can learn from few examples because they have strong inductive biases - factors that guide generalization. But the costs of inference and the difficulty of specifying generative models can make naturalistic data a challenge.
2/n
Neural networks have flexible representations that allow them to handle noisy natural data - as evidenced by the success of large language models. However, they notoriously require huge numbers of examples.
This very nice piece by Ted Chiang describes ChatGPT as a lossy compression of the Internet.
This idea is helpful for building intuition, but it's easy to miss an important point: Lossiness is not always a problem! In fact, if done right, it is exactly what we want.
To make this concrete, let’s consider a specific example. Suppose you encounter this list of sequences:
2/14
One way you could compress the list is by specifying only which pairs of adjacent letters occur (a bigram model). For example, the sequence "a a b b" is made by stitching together "<START> a", "a a", "a b", "b b", and "b <END>".
Neural-network language models (e.g., GPT-2) can generate high-quality text. Are they simply copying text they have seen before, or do they have generalizable linguistic abilities?
We generate text from language models and then analyze whether the text is novel or duplicated from the training set. We analyze novelty for sequential structure (n-grams) and syntactic structure.
2/n
In model-generated text, very few bigrams and trigrams are novel - i.e., most of them appear in the training set. But for 5-grams and larger, the majority are novel!
New paper: "Does syntax need to grow on trees? Sources of hierarchical inductive bias in sequence-to-sequence networks" w/ @Bob_Frank & @TalLinzen to appear in TACL
Interested in syntactic generalization? Read on! 1/
@bob_frank@tallinzen For 2 syntactic tasks, we train models on training sets that are ambiguous between two rules: one rule based on hierarchical structure and one based on linear order.
2/12
@bob_frank@tallinzen We then test the models on examples where the rules make different predictions to see whether they are biased toward linear or hierarchical rules. We do this for 100 re-runs of each model type to control for variability.