Fun thread using some simulations modeled on the ARREST trial design (presented @CritCareReviews a few months ago) to talk through some potential features you might see when we talk about “adaptive” trials
DISCLAIMER: this is not just a “frequentist” versus “Bayesian” thread. Yes, this trial used a Bayesian statistical approach, but there are frequentist options for interim analyses & adaptive features, and that’s a longer debate for another day.
DISCLAIMER 2: this is just a taste using one motivational example for discussion; please don’t draw total sweeping generalizations about “what adaptive trials do” from this thread, as the utility of each “feature” must always be carefully considered in that specific context
Anyways, this trial is pretty neat to walk through because it’s not too dreadfully hard to simulate and the design offers a chance to gently explore “how it works” for folks who are curious about adaptive trials.
Enroll the first 30 patients using 1:1 randomization. Perform interim analysis based on outcomes of those 30 patients.
If probability of superiority for ECMO > 0.986 after 30 patients, stop trial for efficacy (since this would suggest ECMO is so highly effective that it’s arguably unethical to continue randomizing patients not to receive it).
If probability of superiority of standard care > 0.986, stop the trial for harm (since this would suggest disastrous results with ECMO).
If neither of the above conditions are met, trial would continue and plan to enroll another 30 patients.
If trial continues past first interim, randomization for the next 30 participants was to be weighted in proportion to the posterior probability of the superior treatment at the most recent analysis, though restricted not to exceed 3:1 in either direction.
(I’ll explore this specific feature in some detail later in the thread as “response adaptive” randomization has a lot of tricky discussion of the pros and cons…)
The next interim analysis would occur at 60 patients, and every 30 patients thereafter (n=90, n=120, n=150) up to a maximum of 150 patients.
So I think I’d say the two main “adaptive” features to be aware of are i) the flexible sample size and ii) the response-adaptive randomization probabilities that would change at each interim analysis.
The trial team powered the study based on an assumed 12% survival probability in the standard care group, and concluded that the trial had 90% power to detect a benefit if the probability of survival was increased to 37% in the ECMO group.
They also established that the overall Type I error of this design was controlled at 5% if the true survival probability was 12% in both groups (e.g. no treatment benefit).
I’d like to walk through some simulated scenarios of the adaptive design versus a “conventional” design (e.g. parallel-group trial of n=150 with 1:1 randomization throughout) to discuss what you (can) gain and what you (can) give up by going with an adaptive design.
First: I’ll do 1000 simulations of the exact design of ARREST and report the simulated trial outcomes under the same assumptions the authors made for their primary power calculation (12% survival probability with control, increased to 37% probability of survival with ECMO).
First analysis (n=30 patients; N=1000 trials): 119 trials stopped for efficacy, 881 trials continued.
Second analysis (n=60 patients; N=881 trials remaining): 267 trials stopped for efficacy, 614 continued.
Third analysis (n=90; N=614 trials remaining): 269 trials stopped for efficacy, 345 continued.
Fourth analysis (n=120; N=345 trials remaining): 162 trials stopped for efficacy, 183 continued
Final analysis (n=150; N=183 trials remaining): 76 trials concluded efficacy, 107 trials failed to meet the efficacy threshold of prob(superiority)>0.986
OK, what did we learn from this set of simulations?
First: overall power of the design to detect this treatment effect (12% survival in standard; 37% survival in ECMO) is about 90% - since 893 of 1000 simulated trials would reach an efficacy conclusion (this would stabilize a bit more over a larger number of simulations)
Compare this to power for a conventional parallel-group design with n=150 patients randomized 1:1 and the same outcome (37% the survival with ECMO versus 12% survival with standard therapy) of about 95%
The ARREST design gets you to about 90% power, so we did lose a little bit of statistical power versus the conventional design, but that assumes we recruit 150 patients without any provision for early stopping, which brings me to the second point…
The adaptive approach offers the opportunity to stop early, and in fact we see that this would be highly likely if the therapy is in fact highly effective (notice that 817 of the 1000 simulations terminated before n=150 because efficacy threshold was crossed)
In fact, under the assumed treatment effects, there was >50% prob that trial would terminate by n=90 interim analysis, meaning that (if the treatment was highly effective) there’s a reasonably high probability the trial stops much sooner than would occur in conventional design
Third thing: not only does the trial have the ability to terminate more quickly if an efficacy threshold is crossed, but the response-adaptive randomization means that (in theory…) more people inside the trial are getting treated with the better therapy.
This sounds like such an obvious thing that some people wonder why we don’t just do it for everything. How can you argue against using the study data to inform what people should get? If the study is showing one is better than the other, isn’t it good to use that one more?
There are tradeoffs, though: the first is that (all other things being equal) you lose a little bit of statistical power with RAR versus a straight 1:1 randomization (even if things go well early and the “right” treatment does well from the start).
The second is that a little bit of bad luck early can steer you in the wrong direction.
For example, in one of the simulated trials 10/15 died in the control group versus 12/15 died in the ECMO group in the first cohort of 30.
With a posterior probability of about 29% superiority of ECMO, this means that randomization in the second cohort is 29% probability of assignment to ECMO and 71% probability of assignment to control. What happened from there in this simulation?
In the second group of 30 patients, 22 were assigned to standard therapy (21 died) and 8 were assigned to ECMO (4 died).
So by the second interim, the results were starting to look better for ECMO (16/23 deaths with ECMO versus 31/37 with standard therapy) but at the cost of a lot of patients being treated with the ‘wrong’ thing based on some early noise.
In expectation, of course, this is a fairly unlikely event, but it’s a risk you must be aware of with RAR: early fluke results can tilt things off course and take a little while to recover (or even result in the trial terminating early for futility).
In that simulation, the early results almost completely reversed later on: the trial concluded with a finding of efficacy at the n=120 interim analysis - but that early blip did result in a lot of people getting the “wrong” treatment in the second cohort of 30 patients.
PLEASE NOTE: I will not give a full treatise on RAR here. I’m happy to field some follow-up questions or send them on to true experts, but don’t want to spend too much longer on that specific point because the utility is variable depending on specific context.
Some considerations: how long should you wait before varying randomization? Should you restrict them not to tilt too far? Is it more efficient for a multi-arm trial than a 2-arm trial? Ethically, is the greater duty to patients in the trial or those outside the trial?
Nonetheless, you can show that despite the potential downsides, in aggregate/expectation using RAR tends to result in a greater proportion of people in the trial being more likely to get the better thing.
OK. So this all sounds great. We have nearly the same statistical power in the ARREST design as we do in the conventional design but may be able to get there with fewer patients (and more of those randomized being treated with the better thing, if it actually works).
But it’s hard to get something without giving something up, and an oft-expressed concern about these more exotic trial designs is that they somehow lower the bar and make it easier to conclude ‘success’ for things that don’t work.
While (some!) Bayesians argue that Type I error is not interesting or necessary, most regulatory authorities with Bayesian designs require that the trial demonstrate it preserves overall Type I error control.
Anyways: it turns out 29 of my 1,000 simulations with no benefit concluded efficacy, so this shows the type I error hangs out in the right general ballpark.
For those interested…what actually happened in the ARREST trial?
Well, actually none of this fancy adaptive randomization happened because the trial terminated at the first interim with N=30, having 6 survivors in the ECMO group versus just 1 in the standard group, meeting the threshold of prob(superiority)=0.9861 to stop the trial early.
DISCLAIMER: obviously I am not qualified to give the clinician’s perspective on the trial as a whole. There was some excellent discussion @CritCareReviews
(e.g. is this just reflective of great outcomes at one highly expert center? Generalizable to other settings? Can we possibly trust a trial that randomized only 30 patients? And so on…)
My intent in this thread was simply to explain some of the adaptive features employed in the proposed design, the different ways this trial can unfold, and the pros & cons of using such a design versus a more conventional parallel-group design.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Here is a little intro thread on how to do simulations of randomized controlled trials.
This thread will take awhile to get all the way through & posted, so please be patient. Maybe wait a few minutes and then come back to it.
This can be quite useful if you’re trying to understand the operating characteristics (power, type I error probability, potential biases introduced by early stopping rules) of a particular trial design.
I will use R for this thread. It is free. I am not interested in debates about your favorite stats program at this time.
If you want to do it in something else, the *process* can still be educational; you’ll just have to learn to mimic this process in your preferred program.
Here’s a brief follow-up thread answering a sidebar question to the last 2 weeks’ threads on interim analyses in RCT’s and stopping when an efficacy threshold is crossed
The “TL;DR” summary of the previous lesson(s): yes, an RCT that stops early based on an efficacy threshold will tend to overestimate the treatment effect a bit, but that doesn’t actually mean the “trial is more likely to be a false positive result”
(Also, it seems that this is generally true for both frequentist and Bayesian analyses, though the prior may mitigate the degree to which this occurs in a Bayesian analysis)
As promised last week, here is a thread to explore and explain some beliefs about interim analyses and efficacy stopping in randomized controlled trials.
Brief explanation of motivation for this thread: many people learn (correctly) that randomized trials which stop early *for efficacy reasons* will tend to overestimate the magnitude of a treatment effect.
This sometimes gets mistakenly extended to believing that trials which stopped early for efficacy are more likely to be “false-positive” results, e.g. treatments that don’t actually work but just got lucky at an early interim analysis.
Having one of those mornings where you realize that it's sometimes a lot more work to be a good scientist/analyst than a bad one.
(Explanation coming...)
Processing some source data that could just be tabulated and summarized with no one the wiser, thereby including some obviously impossible data points, e.g. dates that occurred before study began, double-entries, things of that nature.
Not exactly an original observation here, but when we talk about issues with stats/data analysis done by non-experts, this is often just as big of an issue (or a bigger issue) than whether they used one of those dumb flow diagrams to pick which analysis to do.
OK. The culmination of a year-plus, um, argument-like thing is finally here, and it's clearly going to get discussed on Twitter, so I'll post a thread on the affair for posterity & future links about my stance on this entire thing.
A long time ago, in a galaxy far away, before any of us had heard of COVID19, some surgeons (and, it must be noted for accuracy, a PhD quantitative person...) wrote some papers about the concept of post-hoc power.
I was perturbed, as were others. This went back and forth over multiple papers they wrote in two different journals, drawing quite a bit of Twitter discussion *and* a number of formal replies to both journals.
Inspired by this piece which resonated with me and many others, I'm going to run in a little different direction: the challenge of "continuing education" for early- and mid-career faculty in or adjacent to statistics (or basically any field that uses quantitative methods).
I got a Master's degree in Applied Statistics and then a PhD in Epidemiology. The truth is, there wasn't much strategy in the decision - just the opportunities that were there at the time - but Epi seemed like a cool *specific* application of statistics, so on I went
But then, as an early-career faculty member working more as a "statistician" than "epidemiologist" - I've often given myself a hard time for not being a better statistician. I'm not good on theory. I have to think really hard sometimes about what should be pretty basic stuff.