Michael C. Frank Profile picture
Apr 4 16 tweets 4 min read Twitter logo Read on Twitter
People are testing large language models (LLMs) on their "cognitive" abilities - theory of mind, causality, syllogistic reasoning, etc. Many (most?) of these evaluations are deeply flawed. To evaluate LLMs effectively, we need some principles from experimental psychology.🧵
Just to be clear, in this thread I'm not saying that LLMs do or don't have *any* cognitive capacity. I'm trying to discuss a few basic ground rules for *claims* about whether they do.
Why use ideas from experimental psychology? Well, ChatGPT and other chat LLMs are non-reproducible. Without versioning and random seeds, we have to treat them as "non-human subjects."
To make the analogy clear: a single generation from an LLM is being compared to one measurement of a single person. The prompt defines the task and the specific question being asked is the measurement item (sometimes these are one and the same).
Flip it around: it'd be super weird to approach a person on the street, ask them a question from the computer science literature (say, about graph coloring), and then based on the result, tweet that "people do/don't have the ability to compute NP-hard problems"!

So, principles:
1. An evaluation study must have multiple observations for each evaluation item! It must also have multiple items for each construct. If you have one or a few items, you can't tell if it's the idiosyncrasy of items that led to the observed responses. See: cambridge.org/core/journals/…
Far too many claims being made have either a small number of generations (one screenshot!) or a small number of prompts/questions - or both. To draw generalizable conclusions you need at a bare minimum dozens of each.
2. Psychologists know that the way you frame a task matters to what answer you get. Children especially often fail or succeed based on task framing! And sometimes you get responses to a task you didn't intend (i.e., via "task demands").
srcd.onlinelibrary.wiley.com/doi/full/10.11…
We all know LLMs are deeply prompt sensitive too. ("Let's think step by step..."). So why do we use a single prompt to make inferences about their cognitive abilities?
3. When you claim an LLM has X (ToM, causality, etc.), you are making a claim about a construct. But you are testing this construct through an operationalization. Good expt'al psych makes this link explicit, arguing for the validity of the link between measure and construct.
In contrast, many arguments about LLM capacities never explicitly justify that this particular measure (prompt, task, etc.) actually is the [right / best / most reasonable] operationalization of the construct. If the measure isn't valid, it can't tell you about the construct!
4. Experimental evaluations require a control condition! The purpose of the control condition is to hold all other aspects of the task constant but to vary only the specific construct under investigation. Finding a good control is hard, but it's a key part of a good study.
For example, when we query theory of mind using false belief understanding, researchers often compare to control conditions like false photograph understanding that are argued to be similar in all respects except the presence of mental state information: saxelab.mit.edu/use-our-effici…
In sum: minimal criteria for making claims about an LLM should be 1) multiple generations & test items, 2) a range of prompts (tasks), 3) evidence for task validity as a measure of the target construct, and 4) a control condition equating other aspects of the task.
All of these ideas (in the human context) are discussed more in our free textbook, Experimentology:
Addenda from good points in the replies: 5) evaluations should be novel, meaning not in the training data, and 6) varying parameters of the LLMs is important, at least the one(s) you have access to (temperature).

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Michael C. Frank

Michael C. Frank Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @mcxfrank

Mar 27
How do we compare the scale of language learning input for large language models vs. humans? I've been trying to come to grips with recent progress in AI. Let me explain these two illustrations I made to help. 🧵 ImageImage
Recent progress in AI is truly astonishing, though somewhat hard to interpret. I don't want to reiterate recent discussion, but @spiantado has a good take in the first part of lingbuzz.net/lingbuzz/007180; l like this thoughtful piece by @MelMitchell1 as well: pnas.org/doi/10.1073/pn…
Many caveats still apply. LLMs are far from perfect, and I am still struggling with their immediate and eventual impacts on science (see linked thread). My goal in the current thread is to think about them as cognitive artifacts instead.
Read 19 tweets
Mar 22
My lab held a hackathon yesterday to play with places where large language models could help us with our research in cognitive science. The mandate was, "how can these models help us do what we do, but better and faster."

Some impressions:🧵
Whatever their flaws, chat-based LLMs are astonishing. My kids and I used ChatGPT to write birthday poems for their grandma. I would have bet money against this being possible even ten years ago.

But can they be used to improve research in cognitive science and psychology?
1. Using chat-based agents to retrieve factual knowledge is not effective. They are not trained for this and they do it poorly (the "hallucination problem"). Ask ChatGPT for a scientist bio, and the result will be similar but with random swaps of institutions, dates, facts, etc.
Read 15 tweets
Jan 20
Do you want to do a psychology experiment while following best practices in open science? My collaborators and I have created Experimentology, a new open web textbook (to be published by MIT Press but free online forever).

experimentology.io

Some highlights! 🧵
The book is intended for advanced undergrads or grad students, and is designed around the flow of experimental project - from planning through design, execution, and reporting, with open science concepts like reproducibility, data sharing, and preregistration woven throughout. Experiment sketch (population, sample, experimental design,
We start by thinking through what an experiment is, highlighting the role of randomization in making causal claims and introduce DAGs (causal graphs) as a tool for thinking about these. We then discuss how experiments relate to psychological theories.
experimentology.io/1-experiments Graphical model showing confounds being eliminatedNomological network of constructs and causal links between t
Read 13 tweets
Jan 7, 2019
For two years, @mbraginsky, @danyurovsky, Virginia Marchman, and I have been working on a book called "Variability and Consistency in Early Language Learning: The Wordbank Project" (@mitpress).

Here's our draft: langcog.github.io/wordbank-book/…

[+ a thread with a few of our findings]
We look at child language using a big dataset of parent reports of children's vocabulary from wordbank.stanford.edu, w/ 75k kids and 25 languages. (Data are from MacArthur-Bates CDI and variants). Surprisingly, parent report is both reliable and valid! langcog.github.io/wordbank-book/…
First finding: It's long been known that children are variable with respect to language. The striking thing is that the level of variability is very consistent across languages. The world around, toddlers are all over the place with respect to language! langcog.github.io/wordbank-book/…
Read 12 tweets
Sep 24, 2018
What is "the open science movement"? It's a set of beliefs, research practices, results, and policies that are organized around the central roles of transparency and verifiability in scientific practice. An introductory thread. /1
The core of this movement is the idea of "nullius in verba" - take no one's word for it. The distinguishing feature of science on this account is the ability to verify claims. Science is independent of the scientist and subject to skeptical inquiry. /2

en.wikipedia.org/wiki/Mertonian….
These ideas about the importance of verification are supported by a rich and growing research literature suggesting that not all published science is verifiable. Some papers have typos, some numbers can't be reproduced, some experiments can't be replicated independently. /3
Read 12 tweets
Aug 6, 2018
A thought on grad advising. When I was a second year, an announcement went out to our dept. with the abstract for a talk I was giving in the area talk series. A senior faculty member wrote back with a scathing critique (cc'd to my advisor, @LanguageMIT). /1
The part that made the biggest impression on me: they said that the first line of my abstract was *so embarrassing that they thought my graduate training had failed*! Actual quote: "You look naive at best, many other things at worst." And on from there. /2
My advisor wrote back immediately: "Hi [critic], I wrote that line." /3
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(