10 years of replication and reform in psychology. What has been done and learned?

Our latest paper prepared for the Annual Review summarizes the advances in conducting and understanding replication and the reform movement that has spawned around it.

psyarxiv.com/ksfvq/

1/
We open w/ anecdote of the 2014 special issue of Social Psychology. The event encapsulated themes that played out over the decade. The issue brought attention to replications, Registered Reports, & spawned “repligate”

econtent.hogrefe.com/toc/zsp/45/3

Figure from royalsocietypublishing.org/doi/full/10.10…
We consolidate evidence replication efforts. Across 307 replications, 64% reported statistically significant effect sizes in the same direction with effect sizes 68% as large as the originals. The median replication was 2.8x the sample size of the original, the mean was 15.1x.
Values below zero line are replications<original studies. Open circles are non-significant replications. x-axis is effect size of original.

1st 3 panels are systematic replications; 4th panel is 77 multisite replications; 5th panel is a prospective “best practice” replication.
We examine whether there is evidence for self-correcting after failures to replicate. So far, not much, when considering how original studies are cited in the few years following a prominent failure to replicate. More attention needed to this question, see osf.io/preprints/meta…
We review the role of theory, features of original studies, and features of replication studies in determining what replicates and what does not. Weak, non-specific theories don’t help. Neither do poor methods and small samples in original studies.
Neither would incompetence and lack of fidelity in replication studies. However, the existing evidence about qualities of the replication studies does not support these as explaining why there have been high failures to replicate in systematic and multisite replications.
We review evidence that replication outcomes can be predicted in advance by surveys, prediction markets, structured elicitations, and machine learning. This evidence is the basis of developing tools to assess credibility automatically to help guide attention and decision-making.
We review evidence about what could improve replicability and discuss what is optimal: "Low replicability is partly a symptom of tolerance for risky predictions and partly a symptom of poor research practices. Persistent low replicability is a symptom of poor research practices.”
We also discuss evidence such as Protzko et al (2020) that high replicability is achievable. Figure from psyarxiv.com/n2a9x/
We review the cultural, structural, social, and individual impediments to improving replicability: reward systems demanding novelty and discouraging replication, social challenges navigating hostility and reputational stakes, & confirmation, hindsight, & outcome reasoning biases.
We summarize the reform movement and its progress including the decentralized strategy for culture change. For example, this Figure shows adoption of transparency policies (TOP) by a random sample and a selection of high-impact psychology journals.
This Figure illustrates the integration of the many services, communities, and stakeholders that are individually and collectively contributing to scaling up behavior and culture change toward more transparency and rigor.
We close by highlighting the metascience community in psychology that has turned theory & methodology to examine itself. With all the progress made, its greatest contribution might be having identified even more of what we do not yet understand. psyarxiv.com/ksfvq/

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Brian Nosek

Brian Nosek Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @BrianNosek

10 Sep 20
Our prospective replication study released!

5 years: 16 novel discoveries get round-robin replication.

Preregistration, large samples, transparency of materials.

Replication effect sizes 97% the size of confirmatory tests!

psyarxiv.com/n2a9x

Lead: @JProtzko 1/
When teams made a new discovery, they submitted it to a prereg’d confirmatory test (orange).

Confirmatory tests subjected to 4 replications (Ns ~ 1500 each)

Original team wrote full methods section. Team conducted independent replications (green) and a self-replication (blue).
Based on confirmatory effect sizes and replication sample sizes, we’d expect 80% successful replications (p<.05). We observed 86%.

Exceeding possible replication rate based on power surely due to chance. But, outcome clearly indicates that high replicability is achievable
Read 10 tweets
25 May 19
Happy to elaborate. Think of preregistration of analysis plans as hypothesizing, data analysis, and scenario planning all rolled into one and without knowing what the data are. This creates a novel decision-making situation. 1/
For example, the first time preregistering an analysis plan, many people report being shocked at how hard it is without seeing the data. It produces a recognition that our analysis decision-making (and hypothesizing) had been much more data contingent than we realized. 2/
Without the data, there is a lot of new mental work to articulate precisely what the hypothesis is and how the data could be used to evaluate that hypothesis. My odd experience was believing that I had been doing that all along, w/out realizing that I used so much discretion. 3/
Read 12 tweets
8 Jan 19
Some predictions about whether the researcher's ideology effects their likelihood of replicating a prior result. ht @jayvanbavel

First, I have no doubt that ideology CAN influence replicability. Classic Rosenthal work + more provides good basis.

So, under what conditions?
1. Ideology may guide selection of studies to replicate. More likely to pursue implausible X because it disagrees with my priors; and pursue plausible Y because it agrees with my priors.

On balance, this may be a benefit of ideology to help with self-correction and bolstering.
2. Ideology may shape design of studies. More likely to select design conditions to fail if I don't like the idea; more likely to select design to succeed if I like the idea.

This is a problem because of tendency for overgeneralization of limited conditions to phenomenon. But,
Read 8 tweets
19 Nov 18
Many Labs 2: 28 findings, 60+ samples, ~7000 participants each study, 186 authors, 36 nations.

Successfully replicated 14 of 28 psyarxiv.com/9654g

ML2 may be more important than Reproducibility Project: Psychology. Here’s why...

@michevianello @fredhasselman @raklein3
ML2 minimized boring reasons for failure. First, using original materials & Registered Reports cos.io/rr all 28 replications met expert reviewed quality control standards. Failure to replicate not easily dismissed as replication incompetence. psyarxiv.com/9654g
Second, the total ML2 replication median sample size (n = 7157) was 64x original median sample size (n = 112). If there was an effect to detect, even a much smaller one, we would detect it. Ultimate estimates have very high precision. psyarxiv.com/9654g
Read 15 tweets
27 Aug 18
We replicated 21 social science experiments in Science or Nature. We succeeded with 13. Replication effect sizes were half of originals. All materials, data, code, & reports: osf.io/pfdyw/, preprint socarxiv.org/4hmb6/, Nature Human Behavior nature.com/articles/s4156…
Using prediction markets we found that researchers were very accurate in predicting which studies would replicate and which would not. (blue=successful replications; yellow=failed replications; x-axis=market closing price) socarxiv.org/4hmb6/ nature.com/articles/s4156… #SSRP
Design ensured 90% power to detect an effect size half as large as original study. Replications averaged 5x the sample size of originals. We obtained original materials in all but one case, and original authors provided very helpful feedback on design. socarxiv.org/4hmb6/
Read 22 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!