10 years of replication and reform in psychology. What has been done and learned?
Our latest paper prepared for the Annual Review summarizes the advances in conducting and understanding replication and the reform movement that has spawned around it.
We open w/ anecdote of the 2014 special issue of Social Psychology. The event encapsulated themes that played out over the decade. The issue brought attention to replications, Registered Reports, & spawned “repligate”
We consolidate evidence replication efforts. Across 307 replications, 64% reported statistically significant effect sizes in the same direction with effect sizes 68% as large as the originals. The median replication was 2.8x the sample size of the original, the mean was 15.1x.
Values below zero line are replications<original studies. Open circles are non-significant replications. x-axis is effect size of original.
1st 3 panels are systematic replications; 4th panel is 77 multisite replications; 5th panel is a prospective “best practice” replication.
We examine whether there is evidence for self-correcting after failures to replicate. So far, not much, when considering how original studies are cited in the few years following a prominent failure to replicate. More attention needed to this question, see osf.io/preprints/meta…
We review the role of theory, features of original studies, and features of replication studies in determining what replicates and what does not. Weak, non-specific theories don’t help. Neither do poor methods and small samples in original studies.
Neither would incompetence and lack of fidelity in replication studies. However, the existing evidence about qualities of the replication studies does not support these as explaining why there have been high failures to replicate in systematic and multisite replications.
We review evidence that replication outcomes can be predicted in advance by surveys, prediction markets, structured elicitations, and machine learning. This evidence is the basis of developing tools to assess credibility automatically to help guide attention and decision-making.
We review evidence about what could improve replicability and discuss what is optimal: "Low replicability is partly a symptom of tolerance for risky predictions and partly a symptom of poor research practices. Persistent low replicability is a symptom of poor research practices.”
We also discuss evidence such as Protzko et al (2020) that high replicability is achievable. Figure from psyarxiv.com/n2a9x/
We review the cultural, structural, social, and individual impediments to improving replicability: reward systems demanding novelty and discouraging replication, social challenges navigating hostility and reputational stakes, & confirmation, hindsight, & outcome reasoning biases.
We summarize the reform movement and its progress including the decentralized strategy for culture change. For example, this Figure shows adoption of transparency policies (TOP) by a random sample and a selection of high-impact psychology journals.
This Figure illustrates the integration of the many services, communities, and stakeholders that are individually and collectively contributing to scaling up behavior and culture change toward more transparency and rigor.
We close by highlighting the metascience community in psychology that has turned theory & methodology to examine itself. With all the progress made, its greatest contribution might be having identified even more of what we do not yet understand. psyarxiv.com/ksfvq/
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Happy to elaborate. Think of preregistration of analysis plans as hypothesizing, data analysis, and scenario planning all rolled into one and without knowing what the data are. This creates a novel decision-making situation. 1/
For example, the first time preregistering an analysis plan, many people report being shocked at how hard it is without seeing the data. It produces a recognition that our analysis decision-making (and hypothesizing) had been much more data contingent than we realized. 2/
Without the data, there is a lot of new mental work to articulate precisely what the hypothesis is and how the data could be used to evaluate that hypothesis. My odd experience was believing that I had been doing that all along, w/out realizing that I used so much discretion. 3/
Some predictions about whether the researcher's ideology effects their likelihood of replicating a prior result. ht @jayvanbavel
First, I have no doubt that ideology CAN influence replicability. Classic Rosenthal work + more provides good basis.
So, under what conditions?
1. Ideology may guide selection of studies to replicate. More likely to pursue implausible X because it disagrees with my priors; and pursue plausible Y because it agrees with my priors.
On balance, this may be a benefit of ideology to help with self-correction and bolstering.
2. Ideology may shape design of studies. More likely to select design conditions to fail if I don't like the idea; more likely to select design to succeed if I like the idea.
This is a problem because of tendency for overgeneralization of limited conditions to phenomenon. But,
ML2 minimized boring reasons for failure. First, using original materials & Registered Reports cos.io/rr all 28 replications met expert reviewed quality control standards. Failure to replicate not easily dismissed as replication incompetence. psyarxiv.com/9654g
Second, the total ML2 replication median sample size (n = 7157) was 64x original median sample size (n = 112). If there was an effect to detect, even a much smaller one, we would detect it. Ultimate estimates have very high precision. psyarxiv.com/9654g
We replicated 21 social science experiments in Science or Nature. We succeeded with 13. Replication effect sizes were half of originals. All materials, data, code, & reports: osf.io/pfdyw/, preprint socarxiv.org/4hmb6/, Nature Human Behavior nature.com/articles/s4156…
Using prediction markets we found that researchers were very accurate in predicting which studies would replicate and which would not. (blue=successful replications; yellow=failed replications; x-axis=market closing price) socarxiv.org/4hmb6/nature.com/articles/s4156…#SSRP
Design ensured 90% power to detect an effect size half as large as original study. Replications averaged 5x the sample size of originals. We obtained original materials in all but one case, and original authors provided very helpful feedback on design. socarxiv.org/4hmb6/