12,399 views

Matthew Salganik

@msalganik

, 35 tweets, 22 min read

My Authors

If hundreds of scientists created predictive algorithms with high-quality data, how well would the best predict life outcomes? Not very well. Fragile Families Challenge: paper in PNAS w 112 authors doi.org/10.1073/pnas.1… & Special Collection of Socius journals.sagepub.com/topic/collecti…

@FFCWS

@FFCWS

We started with high-quality data. The Fragile Families and Child Wellbeing Study (@FFCWS) measured numerous domains of life for a cohort of families over many years. It has been used in more than 750 scientific papers. ffpubs.princeton.edu

We used these data in a new way: the common task method. We picked 6 outcome variables (eg GPA). Approved researchers who agreed to our terms received predictors for all families (background) & outcomes for half (training). Goal: predict outcomes they did not receive (holdout).

160 teams tried. No one was very successful. For every outcome, the best algorithm was much closer to simple guessing than it was to perfect prediction. And it was only slightly better than a 4 variable regression model (dashed).

What does an R^2_holdout of 0.2 look like? Here is the most accurate submission predicting GPA.

We thought perhaps some algorithms would predict some observations well, and other algorithms would predict other observations well. Nope. They missed pretty much the same way for all families.

For policymakers deploying predictive algorithms in high-stakes decisions, our result is a reminder of a basic fact: one should not assume that algorithms predict well. That must be demonstrated with transparent, empirical evidence.

For scientists, our result raises an understanding/prediction paradox: understanding has been generated by these data (as demonstrated by more than 750 published journal articles), yet the very same data could not yield accurate predictions.

The paradox is resolvable in at least three ways: (1) our understanding is poor, (2) prediction is a poor measure of understanding, or (3) our understanding is incomplete without a theory that points toward poor prediction. Future research is needed.

Poor predictions by 1 team could be ignored. The collective failure of 160 teams is harder to ignore. This mass collaboration illustrates a broader idea: some social research questions may be better solved collectively rather than individually. We can do more together than alone.

Paper at doi.org/10.1073/pnas.1…. Replication materials at doi.org/10.7910/DVN/CX….

@ProfFilizGarip

@ProfFilizGarip

Filiz Garip (@ProfFilizGarip) wrote a thoughtful commentary on our paper: What failure to predict life outcomes can teach us doi.org/10.1073/pnas.2…

The Socius special collection includes 12 papers by participants describing their approaches to the Challenge, 3 papers by our group that will be helpful to researchers creating other mass collaborations, and 1 comment.

@msalganik

@msalganik

Salganik, Lundberg, Kindel, and McLanahan. “Introduction to the Special Collection on the Fragile Families Challenge.” @msalganik @IanLundberg1 @alextkindel doi.org/10.1177/237802…

@JennieBrand1

@JennieBrand1

Ahearn and Brand. “Predicting Layoff among Fragile Families.” @JennieBrand1 doi.org/10.1177%2F2378…

@dremalt

@dremalt

Altschul. "Leveraging Multiple Machine Learning Techniques to Predict Major Life Outcomes from a Small Set of Psychological and Socioeconomic Variables: A Combined Bottom-Up/Top-Down Approach." @dremalt doi.org/10.1177%2F2378…

Carnegie and Wu. "Variable Selection and Parameter Tuning for BART Modeling in the Fragile Families Challenge." doi.org/10.1177%2F2378…

Compton. "A Data-Driven Approach to the Fragile Families Challenge: Prediction through Principal Components Analysis and Random Forests." doi.org/10.1177%2F2378…

@thomasrdavidson

@thomasrdavidson

Davidson. "Black-Box Models and Sociological Explanations: Predicting High School GPA Using Neural Networks." @thomasrdavidson doi.org/10.1177%2F2378…

@anna_fil

@anna_fil

Filippova, Gilroy, Kashyap, Kirchner, Morgan, Polimis, Usmani, and Wang. "Humans in the Loop: Incorporating Expert and Crowdsourced Knowledge for Predictions Using Social Survey Data." @anna_fil @ccgilroy @ridhikash07 @alliecmorgan @kpolimis doi.org/10.1177%2F2378…

@devDdata

@devDdata

Goode, Datta, and Ramakrishnan. "Imputing Data for the Fragile Families Challenge: Identifying Similar Survey Questions with Semi-automated Methods." @devDdata @profnaren @VT_DAC doi.org/10.1177%2F2378…

@SocialPolicy

@SocialPolicy

McKay. "When 4 ≈ 10,000: The Power of Social Science Knowledge in Predictive Performance." @SocialPolicy doi.org/10.1177%2F2378…

@TiUEconomics

@TiUEconomics

Raes. "Predicting GPA at Age 15 in the Fragile Families and Child Wellbeing Study." @TiUEconomics doi.org/10.1177%2F2378…

@eamanjahani

@eamanjahani

Rigobon, Jahani, Suhara, Al-Ghoneim, Alghunaim, Pentland, and Almaatouq. "Winning Models for GPA, Grit, and Layoff in the Fragile Families Challenge." @eamanjahani @suhara @khazgh @azizkag @alex_pentland @amaatouq doi.org/10.1177%2F2378…

Roberts. "Friend Request Pending: A Comparative Assessment of Engineering and Social Science Inspired Approaches to Analyzing Complex Birth Cohort Survey Data." doi.org/10.1177%2F2378…

@EHWpolisci

@EHWpolisci

Stanescu, Wang, and Yamauchi. "Using LASSO to Assist Imputation and Predict Child Wellbeing." @EHWpolisci doi.org/10.1177%2F2378…

Kindel, Bansal, Catena, Hartshorne, Jaeger, Koffman, McLanahan, Phillips, Rouhani, Vinh, and Salganik. "Improving Metadata Infrastructure for Complex Surveys: Insights from the Fragile Families Challenge." doi.org/10.1177%2F2378…

@jacob_c_fisher

@jacob_c_fisher

Fisher. “Data-specific Functions: A Comment on Kindel et al.” @jacob_c_fisher doi.org/10.1177%2F2378…

@dayvidliu

@dayvidliu

Liu and Salganik. “Successes and Struggles with Computational Reproducibility: Lessons from the Fragile Families Challenge.” @dayvidliu @msalganik doi.org/10.1177%2F2378…

@IanLundberg1

@IanLundberg1

Lundberg, Narayanan, Levy, and Salganik. "Privacy, Ethics, and Data Access: A Case Study of the Fragile Families Challenge." @IanLundberg1 @random_walker @karen_ec_levy @msalganik doi.org/10.1177%2F2378…

@dayvidliu

@dayvidliu

To promote computational reproducibility, there are Docker images for the Socius papers (see Liu and Salganik 2019) @dayvidliu @msalganik: hub.docker.com/r/2018dliu/fra…

@RussellSageFdn

@RussellSageFdn

The Fragile Families Challenge was supported by grants from the Russell Sage Foundation, NSF, and NICHD. @RussellSageFdn @NSF @NICHD_NIH

@ffcws

@ffcws

The Fragile Families Challenge builds on more than 20 years of work on the Fragile Families and Child Wellbeing Study, which was supported by grants from NICHD and a consortium of private foundations, including the Robert Wood Johnson Foundation. @ffcws

We are grateful to the Fragile Families Challenge Board of Advisers. fragilefamilieschallenge.org/#about

Thank you to everyone who participated in the Fragile Families Challenge!

Enjoying this thread?

Try unrolling a thread yourself!

Enjoying this thread?

Try unrolling a thread yourself!

Embed code for your website

Did Thread Reader help you today?