Tom McCoy Profile picture
Assistant professor @YaleLinguistics. Studying computational linguistics, cognitive science, and AI. He/him.
Oct 10 17 tweets 6 min read
🤖🧠NOW OUT IN PNAS🧠🤖

Language models show many surprising behaviors. E.g., they can count 30 items more easily than 29

In Embers of Autoregression, we explain such effects by analyzing what LMs are trained to do


Major updates since the preprint!

1/n pnas.org/doi/10.1073/pn…At the top is the title of the paper: "Embers of autoregression show how large language models are shaped by the problem they are trained to solve". Below on the left is a screenshot of ChatGPT being asked to count how many words are in a list. The correct answer is 29, but it says 30. Next to it is a plot showing ChatGPT's accuracy at counting elements in a list; in general, it does well on multiples of 10 but poorly on other numbers. The explanation offered at the bottom of the image is: In training sets, round numbers are much more common than other numbers. @ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab In this thread, find a summary of the work & some extensions (yes, the results hold for OpenAI o1!)

And note that we've condensed it to 12 pages - making it a much quicker read than the 84-page preprint!

2/n
Sep 26, 2023 14 tweets 5 min read
🤖🧠NEW PAPER🧠🤖

Language models are so broadly useful that it's easy to forget what they are: next-word prediction systems

Remembering this fact reveals surprising behavioral patterns: 🔥Embers of Autoregression🔥 (counterpart to "Sparks of AGI")


1/8 arxiv.org/abs/2309.13638
The top says: “Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve. By R. Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L. Griffiths.”  The bottom left shows a ClipArt image of fire. The top of the fire is labeled “Sparks of AGI,” and the bottom is labeled “Embers of autoregression”.  The bottom right shows a box labeled “Shift ciphers” with two examples of GPT-4 responses. First, when asked to shift each letter in a message back by 13, GPT-4 gets the correct answer: “I think everyone has their own path, and they ca... @ShunyuYao12 @danfriedman0 @mdahardy @cocosci_lab Our big question: How can we develop a holistic understanding of large language models (LLMs)?

One popular approach has been to evaluate them w/ tests made for humans

But LLMs are not humans! The tests that are most informative about them might be different than for us

2/8 Left: A table listing exams designed for humans that have been used to test GPT-4, such as the LSAT or the SAT Math test. Right: A cartoon showing a bunch of animals lined up (a bird, a monkey, a penguin, an elephant, a fish, a seal, and a dog). In front of the animals is a person saying “For a fair selection, everybody has to take the same exam: Please climb that tree.” The cartoon is by Barry Linton, based on an earlier version by Hans Traxler.
May 30, 2023 15 tweets 6 min read
🤖🧠NEW PAPER🧠🤖

Bayesian models can learn rapidly. Neural networks can handle messy, naturalistic data. How can we combine these strengths?

Our answer: Use meta-learning to distill Bayesian priors into a neural network!

Paper: arxiv.org/abs/2305.14701

1/n A schematic of our method. ... Bayesian models can learn from few examples because they have strong inductive biases - factors that guide generalization. But the costs of inference and the difficulty of specifying generative models can make naturalistic data a challenge.

2/n Screenshot of a demo of Bay...
Feb 14, 2023 15 tweets 4 min read
This very nice piece by Ted Chiang describes ChatGPT as a lossy compression of the Internet.

This idea is helpful for building intuition, but it's easy to miss an important point: Lossiness is not always a problem! In fact, if done right, it is exactly what we want.

1/14 To make this concrete, let’s consider a specific example. Suppose you encounter this list of sequences:

2/14 A list of sequences:  1. a ...
May 4, 2022 5 tweets 3 min read
🤖🧠NEW PAPER🧠🤖

What explains the dramatic recent progress in AI?

The standard answer is scale (more data & compute). But this misses a crucial factor: a new type of computation.

Shorter opinion piece: arxiv.org/abs/2205.01128
Longer tutorial: microsoft.com/en-us/research…

1/5 At the top is a paper title... To understand current AI, we need some insights from CogSci and from 20th-century AI.

In CogSci, two crucial factors for human-level intelligence are compositionality and continuity.

2/5
Nov 19, 2021 13 tweets 5 min read
*NEW PREPRINT*

Neural-network language models (e.g., GPT-2) can generate high-quality text. Are they simply copying text they have seen before, or do they have generalizable linguistic abilities?

Answer: Some of both!

Paper: arxiv.org/abs/2111.09509

1/n Paper title: “How much do language models copy from their Work done with @tallinzen, Paul Smolensky, @JianfengGao0217, & @real_asli.

We generate text from language models and then analyze whether the text is novel or duplicated from the training set. We analyze novelty for sequential structure (n-grams) and syntactic structure.

2/n
Jan 14, 2020 12 tweets 9 min read
New paper: "Does syntax need to grow on trees? Sources of hierarchical inductive bias in sequence-to-sequence networks" w/ @Bob_Frank & @TalLinzen to appear in TACL

Paper arxiv.org/pdf/2001.03632…
Website rtmccoy.com/rnn_hierarchic…

Interested in syntactic generalization? Read on! 1/ @bob_frank @tallinzen For 2 syntactic tasks, we train models on training sets that are ambiguous between two rules: one rule based on hierarchical structure and one based on linear order.

2/12