Assume time-to-event endpoint, alpha = 0.05, power = 80%, hazard ratio to detect 0.75.
Number of events needed for single-stage trial: 380. In single-stage trial you wait for this number of events IN ANY CASE, i.e. even if your initial guess of HR = 0.75 was off.
Assume you add a futility interim (stop trial if HR <= 1) after 30% of events and an efficacy interim (O'Brien-Fleming alpha-spending) after 66% of events. This increases maximal number of events needed from 380 to 408.
Interims are performed after 123 and 270 events.
Now if we run 100 such trials, some of them will actually stop at 1st or 2nd interim. Probabilities for that happening are, under H0 and H1:
futility: 0.50 / 0.06
efficacy: 0.006 / 0.43
So e.g. if the drug is useless, half of all trials will stop at the futility interim.
Stopping at interim of course means we need to collect much less events. The *expected* number of events are thus:
So in both cases the expected number of events is *much less* than the 380 we need to collect in any case in a single-stage design. That is the main advantage of such designs.
2) How does the effect we power at, 0.75, relate to the effect size needed to stop for efficacy?
At efficacy interim, to stop early the p-value must be ≤0.012 and for the trial to be significant at the final it must be ≤0.046. These sig levels correspond to hazard ratios of 0.735 and 0.821, respectively.
Sometimes latter are called minimal detectable differences, MDD.
Often, people believe in order to stop a trial early effect seen at interim must be *much larger* than what we assumed for powering. Comparing 0.75 to 0.735 it is clear that this is not the case. That MDD@interim and effect we power at are ~same is typical for OBF-type boundary..
... and interim after about 2/3 of info.
Another common belief is that in order to be significant at interim we need to observe hazard ratio ≤0.75. Again, not true: MDD at final analysis actually is 0.821, i.e. in order to get p-value of 0.046 or lower this is...
...the hazard ratio we need to beat.
3) "Cheating"? Methodology for group-sequential designs is developed such that familywise-error rate of *all looks at the data* is kept. This is why at the final analysis, the p-value needs to be ≤0.046, not ≤0.05.
This is the price to pay for the interim look.
But why 0.012 + 0.046 > 0.05? Isn't that cheating? No, because through exploiting the correlation between test statistics at interim and final you can "gain" a bit of alpha. Again, no cheating, FWER always protected.
4) If stopping early the effect estimate may be biased. Is this an issue?
Lots has been written about inference adjusted for the fact that trial stopped early. I'd just like to give a median unbiased estimate of the hazard ratio in our example. Assume at the futility we...
...observe HR = 0.69 and at efficacy 0.66 with *conventional* 95% CI from 0.51 to 0.85. Since 0.66≤0.735 trial stopped for efficacy.
Median unbiased estimate accounting for early stopping amounts to 0.68 with adj CI from 0.53 to 0.86. So conventional and adj analysis are close.
5) What happens operationally if trial stops early? Statistically, we "stopped" trial at efficacy interim and rejected H0: hazard ratio = 1 under full type I error control. We would thus proceed with filing the drug.
But of course, operationally the trial would continue: more follow-up data would be collected on 1ry and 2ry endpoints (e.g. OS), safety, biomarker, etc. data collection would also continue, typically for years.
Also, often one would still do an analysis at 408 events, the...
...initially planned final analysis, to make sure results persist over time.
Note that stopping at efficacy interim typically leads to unblinding, so analysis and interpretation of follow-up needs caution and expertise.
So group-sequential designs reduce expected number of events needed and provide valid inference in reasonable cases.
This all within the framework of hypothesis testing, as required by Health Authority guidelines.
I hope this thread is useful. Comments welcome!
The end.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
1/n I was asked to give an industry statistician's view on A) below. Disclaimer: I do not know about the exact regulations (which might also be region-specific). What I offer is a 1st hand experience of what happens around a trial stopping early. A thread.
Planned efficacy interim after ~245 PFS events. 1) Data with sponsor, except for tmt assignment (=rando codes). 2) Rando codes with IxRS vendor. 3) Indep. Stat. Reporting Group (ISRG) coordinates.
Registration still open for EFSPI regulatory statistics webinars. Currently we have 627 and 534 registrations for the two webinars. Webex can handle 1000 people dialing in, so go ahead and register! 😉
"For obtaining causal inferences that are objective, and therefore have the best chance of revealing scientific truths, carefully designed and executed randomized experiments are generally considered to be the gold standard."
"All statistical studies for causal effects are seeking the same type of answer, and real world randomized experiments and comparative observational studies do not form a dichotomy, but rather are on a continuum, from well-suited for drawing causal inferences to poorly suited."