🚨New WP: Can LLMs predict results of social science experiments?🚨
Prior work uses LLMs to simulate survey responses, but can they predict results of social science experiments?
Across 70 studies, we find striking alignment (r = .85) between simulated and observed effects 🧵👇
To evaluate predictive accuracy of LLMs for social science experiments, we used #GPT4 to predict 476 effects from 70 well-powered experiments, including:
➡️50 survey experiments conducted through NSF-funded TESS program
➡️20 additional replication studies (Coppock et al. 2018)
We prompted the model with (a) demographic profiles drawn from a representative dataset of Americans, and (b) experimental stimuli. The effects estimated by pooling these responses were strongly correlated with the actual experimental effects (r = .85; adj. r = 0.91)!
We also find that predictive accuracy improved across generations of LLMs, with GPT4 surpassing predictions elicited from an online sample (N = 2,659) of Americans.
But what if LLMs are simply retrieving & reproducing known experimental results from training data?
We find evidence against this: analyzing only studies *unpublished* at time of GPT4’s training data cut-off, we find high predictive accuracy (r = .90, adj. r = .94).
Important work finds biases in LLM responses resulting from training data inequalities. Do these biases impact accurate prediction of experimental results?
To assess, we compare predictive accuracy for:
➡️women & men
➡️Black & white participants
➡️Democrats & Republicans
Despite known training data inequalities, LLM-derived predictive accuracy was comparable across subgroups.
However, there was little heterogeneity in experimental effects we studied, so more research is needed to assess if/how LLM predictions of experimental results are biased.
We also evaluated predictive accuracy for “megastudies,” studies comparing the impact of a large number of interventions. Across nine survey and field megastudies, LLM-derived predictions were modestly accurate
(Notably, accuracy matched or surpassed expert forecasters)
Finally, we find LLMs can accurately predict effects on socially harmful outcomes, such as the impact of antivax FB posts on vax intentions (@_JenAllen et al., 2024). This capacity may have positive uses, such as for content moderation, though also highlights risks of misuse.
Overall our results show high accuracy of LLM-derived predictions for experiments with human participants, generally greater accuracy than samples of lay and expert humans.
This capacity has several applications for science and practice – e.g., running low-cost pilots to identify promising interventions, or simulating experiments that may be harmful to participants – but also limitations and risks, including concerns about bias, overuse, & misuse.
To explore further, you can use LLM-simulated participants to generate predicted experimental effects using this demo!
👇 treatmenteffect.app
*Major* kudos to @lukebeehewitt and @AshuAshok (who co-led the research), and to @ghezae_isaias.
And thanks to @pascl_stanford and @StanfordPACS for generously supporting this project.
H/t also to some of the many scholars whose work we drew on:
@JEichstaedt @danicajdillion @kurtjgray @chris_bail @lpargyle @johnjhorton @mcxfrank @joon_s_pk @msbernst @percyliang @kerstingAIML @davidwingate @lltjuatja @gneubig @nikbpetrov @SchoeneggerPhil @molly_crockett
Many Republican voters believe the 2020 election was stolen. For them, voting for election-denying Republican candidates helps their favored party AND helps to defend democratic principles.
This is a misinformation problem.
See, e.g.: papers.ssrn.com/sol3/papers.cf…
But many Rep voters do *not* believe 2020 election was stolen. For these folks, deciding whether to vote for election-denying, Republican candidates involves a tension of partisan interests and democratic principles.
⚡️For a quick summary of our results, check out this excellent video produced by the brilliant folks @StanfordHAI (& the thread below!)⚡️
We use the validated @perspective API to estimate levels of “toxicity” in 1.3 million tweets by Congresspeople from '09-'19 (findings robust with alt measures of toxicity)
Overall, toxicity⬆️ 23% over the time period
Over same period, toxicity of Congress speeches actually⬇️
In line with claims that American democracy is in crisis, we found concerning baseline levels of potentially problematic attitudes, e.g.:
➤Partisan animosity
➤Support for undemocratic practices
➤Support for undemocratic candidates
➤Biased evaluation of politicized facts
We test 25 interventions to reduce such attitudes, submitted by social scientists & practitioners. Most targeted partisan animosity, but many also aimed to reduce support for undemocratic practices or partisan violence. Walk-through of interventions here👇
🚨Results are in for the Strengthening Democracy Challenge. Winners will be announced this week!🚨
ITT we announce the 25 submissions we selected to test. We think these submissions are awesome & hope you do too.
But first, how we got here…👇🧵
BACKSTORY: last summer we invited people to submit ideas for how to reduce Americans’ anti-democratic attitudes, support for partisan violence, and/or partisan animosity.
Our research team worked w/ a stellar advisory board to select the 25 interventions we found most promising & then tested them in a massive (N>31,000) online survey experiment
What was eligible?
Short interventions (< 8 minutes) that were deployable online.
🚨Call for Submissions🚨 “The Strengthening Democracy Challenge,” a large-scale project testing interventions to reduce (a) anti-democratic attitudes, (b) support for partisan violence, and/or (c) partisan animosity, is open for submissions NOW 1/
American democracy faces major problems. Americans are willing to compromise on democratic principles for partisan goals. Some people are willing to resort to violence to help their side win. Extreme dislike for rival partisans has grown significantly in recent decades. 2/
To deepen understanding of how to address these problems, we will conduct a large (up to 30k participants) experiment testing up to 25 submitted interventions designed to reduce anti-democratic attitudes, support for partisan violence, and/or partisan animosity among Americans 3/
We identify three broad approaches - general public health messages, promotion by trusted politicians, and promotion by trusted nonpolitical influencers – identifying behavioral science relevant to each.
We highlight @deaneckles and colleagues’ research on how vaccine intentions can be increased (in the US and beyond) by informing people about the actually very high levels of vaccination+vaccine intentions in the general public.