, 26 tweets, 5 min read Read on Twitter
Brief (well, maybe...) non-technical thread attempting to explain “alpha-spending” at interim analyses for the layperson
Not-uncommon question: why we can’t look at the data a bunch of times during a trial and simply stop whenever p<0.05? After all, the evidence supporting an effect is now “statistically significant” right?
Bayesians, please, hold comments until the end. One day, perhaps, there will be more published trials using Bayesian interim-monitoring approaches, but for now, I work with students / trainees / faculty that need help reading / understanding more conventional frequentist trials
Anyways, if you’ve ever wondered “Why can’t we keep looking and stop as soon as we see p<0.05?” – perhaps this little toy simulation will help.
CONSORT: “Trialists who perform unplanned interim analyses without formal stopping rules run a high risk of catching the data at a random extreme, which may represent an overestimate of the treatment benefit or harm, and compromise the validity of their findings....”
Yet, this doesn’t always resonate, especially with emotional appeal of “but the data are clearly trending towards benefit…isn’t it unethical to keep randomizing when we already have a “significant” result?” or “Why does it matter that we looked at the data before?”
Here’s a quick simulation that hopefully will be useful. I wrote a macro in SAS to generate what follows. I’m happy to share the code on request; it’s admittedly a little clunky and I’m sure #rstats wizard can do it better / faster / cheaper, but whatever, it gets the job done.
We start by creating a dataset of 1000 patients (half getting “Treatment A” and half “Treatment B”) with random draws from a Bernoulli distribution with p=0.3 (meaning a “30% chance” of outcome; suppose this is a short-term outcome like “in hospital death” which is binary)
Notice: there is absolutely no “real” effect of treatment in the dataset. Every individual patient is a random observation with 30% chance of death. Their treatment “assignment” has no influence on their outcome.
In each simulation, we perform an “interim analysis” every 200 patients (meaning looks after outcome is known for the first 200, 400, 600, 800, and then all 1000 patients) and saved the p-value from each look.
I ran this 1,000 times (each simulation representing one RCT with 1,000 patients) and saved the p-values from each of the would-be interim looks (100,000 simulations would be better, but I don’t have all day)
51 of the 1,000 trials had p<0.05 at the first interim analysis (200 patients). Suppose that they were stopped at this time, and the remaining 949 trials continued enrolling.
Of the remaining 949 trials, another 28 had p<0.05 at the second interim (400 patients).

Of the remaining 921 trials, another 23 had p<0.05 at the third interim (600 patients).
Of the remaining 898 trials, another 28 had p<0.05 at the fourth interim (800 patients).

Of the remaining 870 trials, another 12 had p<0.05 at the final analysis (1,000 patients).
That’s 142 of 1,000 simulated RCT’s (14.2% of the simulations) which would have shown p<0.05 at one of the interims or the final analysis – with a treatment that has zero effect (remember, the treatment groups are all independent random draws with 30% risk of death).
The whole point of using NHST with alpha=0.05 is to limit the probability of concluding “efficacy” for a therapy that has zero benefit at 5 percent. Whether we should be using p<0.05 (or NHST at all) is a bit controversial, you may have heard…topic for a different time.
But anyways, the point is, taking a bunch of looks with p<0.05 (without any spending function or penalty for the interim looks) inflates the risk of catching the data at a "random extreme" even when there is no true treatment effect.
My little simulation had 14.2% of trials that would have concluded “efficacy” - in simulated data with zero treatment effect. Heck, 130 of the sample trials (13%) would have stopped early under the “look a bunch of times and step when p<0.05” approach.
“If you look at your data after every 100 new data points and then decide to go further or to stop, you should compare those data to a sampling distribution from countless hypothetical samples drawn in that specific way. You can’t just use default p-values”
That’s the answer to “why do we care about the alpha-spend at the final analysis?” – by introducing multiple looks along the way in the decision process, we changed the data-generation process itself, and default p-values no longer have the same meaning.
Anyways, hopefully this is a useful demonstration of problems with using the <0.05 threshold to test for efficacy at a bunch of interim looks and the final analysis without any “spending” function that accounts for multiple looks.
If you’re wondering what would have happened with no interims/early stopping in my simulation; only 64 of the trials (6.4% of the simulations) ended with p<0.05 at the final analysis with 1,000 patients

(if I had done more simulations, this number would gravitate to exactly 5%)
We can use fancy math to calculate exact probabilities for each of the interim looks, but I hoped that the simulation would make it a little easier for people to grasp.
Point is: if you perform a bunch of interim looks and use the same threshold for stopping across the board, the entire purpose of an “alpha level” is compromised. That’s why alpha-spending is necessary for interim looks with trials analyzed in frequentist statistical framework.
Also, please note: in well-designed trials with multiple interim analyses, this is not how interim analyses should actually be performed! It’s an illustrative example of why you SHOULDN’T do this. If you’re out there doing a trial…please don’t do this!
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Andrew Althouse
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!