Tom McCoy Profile picture
Oct 10 17 tweets 6 min read Read on X
🤖🧠NOW OUT IN PNAS🧠🤖

Language models show many surprising behaviors. E.g., they can count 30 items more easily than 29

In Embers of Autoregression, we explain such effects by analyzing what LMs are trained to do


Major updates since the preprint!

1/n pnas.org/doi/10.1073/pn…At the top is the title of the paper: "Embers of autoregression show how large language models are shaped by the problem they are trained to solve". Below on the left is a screenshot of ChatGPT being asked to count how many words are in a list. The correct answer is 29, but it says 30. Next to it is a plot showing ChatGPT's accuracy at counting elements in a list; in general, it does well on multiples of 10 but poorly on other numbers. The explanation offered at the bottom of the image is: In training sets, round numbers are much more common than other numbers.
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab In this thread, find a summary of the work & some extensions (yes, the results hold for OpenAI o1!)

And note that we've condensed it to 12 pages - making it a much quicker read than the 84-page preprint!

2/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab Our big question: How can we develop a holistic understanding of large language models (LLMs)?

One popular approach has been to evaluate them w/ tests made for humans

But LLMs are not humans! The tests that are most informative about them might be different than for us

3/n Left: A table listing exams designed for humans that have been used to test GPT-4, such as the LSAT or the SAT Math test. Right: A cartoon showing a bunch of animals lined up (a bird, a monkey, a penguin, an elephant, a fish, a seal, and a dog). In front of the animals is a person saying “For a fair selection, everybody has to take the same exam: Please climb that tree.” The cartoon is by Barry Linton, based on an earlier version by Hans Traxler.
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab So how can we evaluate LLMs on their own terms?

We argue for a *teleological approach*, which has been productive in cognitive science: understand systems via the problem they adapted to solve

For LLMs this is autoregression (next-word prediction) over Internet text

4/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab By reasoning about next-word prediction, we make several hypotheses abt factors that'll cause difficulty for LLMs

1st is task frequency: we predict better performance on frequent tasks than rare ones, even when the tasks are equally complex

Eg, linear functions (see img)!

5/n Left: Example responses from GPT-4 when asked to take in a number and then multiply it by 9/5 and add 32. It gets this example right. When it is instead asked to multiply by 7/5 and add 31, now it gets the answer wrong. The right shows quantitative results with bar plots: GPT-3.5, GPT-4, Claude 3, and Llama 3 all do better when asked to perform the function (9/5)x+32 than the function (7/5)x + 31. A note explains: “(9/5)x+32 is common because it is the Celsius-to-Fahrenheit conversion. The other function has no special significance.”
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab Another example: shift ciphers - decoding a message by shifting each letter N positions back in the alphabet.

On the Internet, the most common value for N is 13 (rot-13). Language models show a spike in accuracy at a shift of 13!

6/n Left: Example GPT-4 responses on shift ciphers with different shift levels, where the shift level is the number of positions backward in the alphabet that the message has to be shifted to decode it. In all cases, the correct answer is “But this time, there may also be another reason.” GPT-4 gets this correct answer when the shift is 3 or 13. But when the shift is 8, it instead responds “Say what you, think and then be silent.”, and when the shift is 9 it replies, “Try your best, young man and believe in yourself.”
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab The 2nd factor we predict will influence LLM accuracy is output probability

Indeed, across many tasks, LLMs score better when the output is high-probability than when it is low-probability - even though the tasks are deterministic

E.g.: Reversing a list of words (see img)

7/n Top: GPT-4 responses when asked to reverse a list of words. When the answer is a medium-probability list of words, it gets the answer right, but when it is a low-probability list, it gets the answer wrong.  Bottom: A plot showing systematic results. All five language models depicted (GPT-3.5, GPT-4, Claude 3, Llama 3, and Gemini 1.0) show better performance at reversing a list when the output has a high probability than when it has a low probability.
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab Our results show that we should be cautious about applying LLMs in low-probability situations

We should also be careful in how we interpret evaluations. A high score on a test set may not indicate mastery of the general task, esp. if the test set is mainly high-probability

8/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab We previously released a preprint of this work. What's new since then?
1. Condensed the paper to 12 pages (it was 84!)
2. More models: Claude, Llama, Gemini (plus GPT-3.5 & GPT-4)
➡️ Also o1! (see below)
3. Enhanced discussion - thank you to our very thoughtful reviewers!

9/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab I think this could make a fun paper for a reading group or seminar. There’s a lot that could be discussed, and it’s pretty accessible (especially now that it’s been shortened!)

10/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab Now for the question that many have asked us: Does o1 still show these effects, given that it is optimized for reasoning?

To our surprise...it does! o1 shows big improvements but gets the same qualitative effects.

Addendum on arXiv w/ o1 results:

11/narxiv.org/abs/2410.01792
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab Regarding task frequency: o1 does much better on rare versions of tasks than previous models do (left plot). But, when the tasks are hard enough, it still does better on common task variants than rare ones (right two plots)

12/n Plots showing OpenAI o1's performance on common and rare versions of tasks. In general, on our basic evaluations, o1 is close to 100% accuracy on both common and rare task variants. However, on harder evaluations, it shows some separation, with stronger performance on common task variants than rare ones.
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab Regarding output probability: o1 shows clear effects here. Interestingly, the effects don’t just show up in accuracy (top) but also in how many tokens o1 consumes to perform the task (bottom)!

13/n Top plot: o1 scores better on several tasks when the output log probability is high than when it is low.  Bottom plot: o1 uses more tokens when the output is low probability than when it is high probability.
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab You might also wonder how chain-of-thought (CoT) affects things. Models w/ CoT still show memorization effects but also show hallmarks of true reasoning! Thus, CoT brings qualitative improvements but doesn't fully address the embers of autoregression

14/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab One downside of the condensed paper length is that we had to remove lots of references. Apologies to the many people whose excellent papers had to be cut due to length constraints!

15/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab Overall link roundup:

1. Embers of Autoregression: pnas.org/doi/10.1073/pn…

2. Follow-up about OpenAI o1: arxiv.org/abs/2410.01792

3. Analysis of chain-of-thought: arxiv.org/abs/2407.01687

4. Blog post where you can explore model outputs: rtmccoy.com/embers_shift_c…

16/n
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab In conclusion: To understand what language models are, we must understand what we have trained them to be.

For much more, see the paper:

Work by @RTomMcCoy, @ShunyuYao12, @DanFriedman0, @MDAHardy, and Tom Griffiths @cocosci_lab

17/17pnas.org/doi/10.1073/pn…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tom McCoy

Tom McCoy Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @RTomMcCoy

Sep 26, 2023
🤖🧠NEW PAPER🧠🤖

Language models are so broadly useful that it's easy to forget what they are: next-word prediction systems

Remembering this fact reveals surprising behavioral patterns: 🔥Embers of Autoregression🔥 (counterpart to "Sparks of AGI")


1/8 arxiv.org/abs/2309.13638
The top says: “Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve. By R. Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L. Griffiths.”  The bottom left shows a ClipArt image of fire. The top of the fire is labeled “Sparks of AGI,” and the bottom is labeled “Embers of autoregression”.  The bottom right shows a box labeled “Shift ciphers” with two examples of GPT-4 responses. First, when asked to shift each letter in a message back by 13, GPT-4 gets the correct answer: “I think everyone has their own path, and they ca...
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab Our big question: How can we develop a holistic understanding of large language models (LLMs)?

One popular approach has been to evaluate them w/ tests made for humans

But LLMs are not humans! The tests that are most informative about them might be different than for us

2/8 Left: A table listing exams designed for humans that have been used to test GPT-4, such as the LSAT or the SAT Math test. Right: A cartoon showing a bunch of animals lined up (a bird, a monkey, a penguin, an elephant, a fish, a seal, and a dog). In front of the animals is a person saying “For a fair selection, everybody has to take the same exam: Please climb that tree.” The cartoon is by Barry Linton, based on an earlier version by Hans Traxler.
@ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab So how can we evaluate LLMs on their own terms?

We argue for a *teleological approach*, which has been productive in cognitive science: understand systems via the problem they adapted to solve

For LLMs this is autoregression (next-word prediction) over Internet text

3/8
Read 14 tweets
May 30, 2023
🤖🧠NEW PAPER🧠🤖

Bayesian models can learn rapidly. Neural networks can handle messy, naturalistic data. How can we combine these strengths?

Our answer: Use meta-learning to distill Bayesian priors into a neural network!

Paper: arxiv.org/abs/2305.14701

1/n A schematic of our method. ...
Bayesian models can learn from few examples because they have strong inductive biases - factors that guide generalization. But the costs of inference and the difficulty of specifying generative models can make naturalistic data a challenge.

2/n Screenshot of a demo of Bay...
Neural networks have flexible representations that allow them to handle noisy natural data - as evidenced by the success of large language models. However, they notoriously require huge numbers of examples.

3/n Left: A screenshot of ChatG...
Read 15 tweets
Feb 14, 2023
This very nice piece by Ted Chiang describes ChatGPT as a lossy compression of the Internet.

This idea is helpful for building intuition, but it's easy to miss an important point: Lossiness is not always a problem! In fact, if done right, it is exactly what we want.

1/14
To make this concrete, let’s consider a specific example. Suppose you encounter this list of sequences:

2/14 A list of sequences:  1. a ...
One way you could compress the list is by specifying only which pairs of adjacent letters occur (a bigram model). For example, the sequence "a a b b" is made by stitching together "<START> a", "a a", "a b", "b b", and "b <END>".

3/14 A bigram model listing the ...
Read 15 tweets
May 4, 2022
🤖🧠NEW PAPER🧠🤖

What explains the dramatic recent progress in AI?

The standard answer is scale (more data & compute). But this misses a crucial factor: a new type of computation.

Shorter opinion piece: arxiv.org/abs/2205.01128
Longer tutorial: microsoft.com/en-us/research…

1/5 At the top is a paper title...
To understand current AI, we need some insights from CogSci and from 20th-century AI.

In CogSci, two crucial factors for human-level intelligence are compositionality and continuity.

2/5
20th-century symbolic AI was compositional but not continuous. 20th-century neural-network AI was continuous but not compositional.

Current AI achieves its success through architectures (e.g., CNNs & Transformers) that begin to *combine* continuity and compositionality.

3/5 Top: Illustration of compos...
Read 5 tweets
Nov 19, 2021
*NEW PREPRINT*

Neural-network language models (e.g., GPT-2) can generate high-quality text. Are they simply copying text they have seen before, or do they have generalizable linguistic abilities?

Answer: Some of both!

Paper: arxiv.org/abs/2111.09509

1/n Paper title: “How much do language models copy from their
Work done with @tallinzen, Paul Smolensky, @JianfengGao0217, & @real_asli.

We generate text from language models and then analyze whether the text is novel or duplicated from the training set. We analyze novelty for sequential structure (n-grams) and syntactic structure.

2/n
In model-generated text, very few bigrams and trigrams are novel - i.e., most of them appear in the training set. But for 5-grams and larger, the majority are novel!

3/n Plots showing how often n-grams are novel for various values
Read 13 tweets
Jan 14, 2020
New paper: "Does syntax need to grow on trees? Sources of hierarchical inductive bias in sequence-to-sequence networks" w/ @Bob_Frank & @TalLinzen to appear in TACL

Paper arxiv.org/pdf/2001.03632…
Website rtmccoy.com/rnn_hierarchic…

Interested in syntactic generalization? Read on! 1/
@bob_frank @tallinzen For 2 syntactic tasks, we train models on training sets that are ambiguous between two rules: one rule based on hierarchical structure and one based on linear order.

2/12
@bob_frank @tallinzen We then test the models on examples where the rules make different predictions to see whether they are biased toward linear or hierarchical rules. We do this for 100 re-runs of each model type to control for variability.

3/12
Read 12 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(