Tweet

Andrew Althouse

11 Mar, 45 tweets, 7 min read

@CritCareReviews

Fun thread using some simulations modeled on the ARREST trial design (presented @CritCareReviews a few months ago) to talk through some potential features you might see when we talk about “adaptive” trials

https://twitter.com/CritCareReviews/status/1351587953194246146?s=20

DISCLAIMER: this is not just a “frequentist” versus “Bayesian” thread. Yes, this trial used a Bayesian statistical approach, but there are frequentist options for interim analyses & adaptive features, and that’s a longer debate for another day.

DISCLAIMER 2: this is just a taste using one motivational example for discussion; please don’t draw total sweeping generalizations about “what adaptive trials do” from this thread, as the utility of each “feature” must always be carefully considered in that specific context

Anyways, this trial is pretty neat to walk through because it’s not too dreadfully hard to simulate and the design offers a chance to gently explore “how it works” for folks who are curious about adaptive trials.

ARREST (pubmed.ncbi.nlm.nih.gov/33197396/) was designed as follows:

Enroll the first 30 patients using 1:1 randomization. Perform interim analysis based on outcomes of those 30 patients.

If probability of superiority for ECMO > 0.986 after 30 patients, stop trial for efficacy (since this would suggest ECMO is so highly effective that it’s arguably unethical to continue randomizing patients not to receive it).

If probability of superiority of standard care > 0.986, stop the trial for harm (since this would suggest disastrous results with ECMO).

If neither of the above conditions are met, trial would continue and plan to enroll another 30 patients.

If trial continues past first interim, randomization for the next 30 participants was to be weighted in proportion to the posterior probability of the superior treatment at the most recent analysis, though restricted not to exceed 3:1 in either direction.

(I’ll explore this specific feature in some detail later in the thread as “response adaptive” randomization has a lot of tricky discussion of the pros and cons…)

The next interim analysis would occur at 60 patients, and every 30 patients thereafter (n=90, n=120, n=150) up to a maximum of 150 patients.

So I think I’d say the two main “adaptive” features to be aware of are i) the flexible sample size and ii) the response-adaptive randomization probabilities that would change at each interim analysis.

The trial team powered the study based on an assumed 12% survival probability in the standard care group, and concluded that the trial had 90% power to detect a benefit if the probability of survival was increased to 37% in the ECMO group.

They also established that the overall Type I error of this design was controlled at 5% if the true survival probability was 12% in both groups (e.g. no treatment benefit).

I’d like to walk through some simulated scenarios of the adaptive design versus a “conventional” design (e.g. parallel-group trial of n=150 with 1:1 randomization throughout) to discuss what you (can) gain and what you (can) give up by going with an adaptive design.

First: I’ll do 1000 simulations of the exact design of ARREST and report the simulated trial outcomes under the same assumptions the authors made for their primary power calculation (12% survival probability with control, increased to 37% probability of survival with ECMO).

First analysis (n=30 patients; N=1000 trials): 119 trials stopped for efficacy, 881 trials continued.

Second analysis (n=60 patients; N=881 trials remaining): 267 trials stopped for efficacy, 614 continued.

Third analysis (n=90; N=614 trials remaining): 269 trials stopped for efficacy, 345 continued.

Fourth analysis (n=120; N=345 trials remaining): 162 trials stopped for efficacy, 183 continued

Final analysis (n=150; N=183 trials remaining): 76 trials concluded efficacy, 107 trials failed to meet the efficacy threshold of prob(superiority)>0.986

OK, what did we learn from this set of simulations?

First: overall power of the design to detect this treatment effect (12% survival in standard; 37% survival in ECMO) is about 90% - since 893 of 1000 simulated trials would reach an efficacy conclusion (this would stabilize a bit more over a larger number of simulations)

Compare this to power for a conventional parallel-group design with n=150 patients randomized 1:1 and the same outcome (37% the survival with ECMO versus 12% survival with standard therapy) of about 95%

The ARREST design gets you to about 90% power, so we did lose a little bit of statistical power versus the conventional design, but that assumes we recruit 150 patients without any provision for early stopping, which brings me to the second point…

The adaptive approach offers the opportunity to stop early, and in fact we see that this would be highly likely if the therapy is in fact highly effective (notice that 817 of the 1000 simulations terminated before n=150 because efficacy threshold was crossed)

In fact, under the assumed treatment effects, there was >50% prob that trial would terminate by n=90 interim analysis, meaning that (if the treatment was highly effective) there’s a reasonably high probability the trial stops much sooner than would occur in conventional design

Third thing: not only does the trial have the ability to terminate more quickly if an efficacy threshold is crossed, but the response-adaptive randomization means that (in theory…) more people inside the trial are getting treated with the better therapy.

This sounds like such an obvious thing that some people wonder why we don’t just do it for everything. How can you argue against using the study data to inform what people should get? If the study is showing one is better than the other, isn’t it good to use that one more?

There are tradeoffs, though: the first is that (all other things being equal) you lose a little bit of statistical power with RAR versus a straight 1:1 randomization (even if things go well early and the “right” treatment does well from the start).

The second is that a little bit of bad luck early can steer you in the wrong direction.

For example, in one of the simulated trials 10/15 died in the control group versus 12/15 died in the ECMO group in the first cohort of 30.

With a posterior probability of about 29% superiority of ECMO, this means that randomization in the second cohort is 29% probability of assignment to ECMO and 71% probability of assignment to control. What happened from there in this simulation?

In the second group of 30 patients, 22 were assigned to standard therapy (21 died) and 8 were assigned to ECMO (4 died).

So by the second interim, the results were starting to look better for ECMO (16/23 deaths with ECMO versus 31/37 with standard therapy) but at the cost of a lot of patients being treated with the ‘wrong’ thing based on some early noise.

In expectation, of course, this is a fairly unlikely event, but it’s a risk you must be aware of with RAR: early fluke results can tilt things off course and take a little while to recover (or even result in the trial terminating early for futility).

In that simulation, the early results almost completely reversed later on: the trial concluded with a finding of efficacy at the n=120 interim analysis - but that early blip did result in a lot of people getting the “wrong” treatment in the second cohort of 30 patients.

PLEASE NOTE: I will not give a full treatise on RAR here. I’m happy to field some follow-up questions or send them on to true experts, but don’t want to spend too much longer on that specific point because the utility is variable depending on specific context.

Some considerations: how long should you wait before varying randomization? Should you restrict them not to tilt too far? Is it more efficient for a multi-arm trial than a 2-arm trial? Ethically, is the greater duty to patients in the trial or those outside the trial?

Nonetheless, you can show that despite the potential downsides, in aggregate/expectation using RAR tends to result in a greater proportion of people in the trial being more likely to get the better thing.

OK. So this all sounds great. We have nearly the same statistical power in the ARREST design as we do in the conventional design but may be able to get there with fewer patients (and more of those randomized being treated with the better thing, if it actually works).

But it’s hard to get something without giving something up, and an oft-expressed concern about these more exotic trial designs is that they somehow lower the bar and make it easier to conclude ‘success’ for things that don’t work.

While (some!) Bayesians argue that Type I error is not interesting or necessary, most regulatory authorities with Bayesian designs require that the trial demonstrate it preserves overall Type I error control.

Anyways: it turns out 29 of my 1,000 simulations with no benefit concluded efficacy, so this shows the type I error hangs out in the right general ballpark.

For those interested…what actually happened in the ARREST trial?

Well, actually none of this fancy adaptive randomization happened because the trial terminated at the first interim with N=30, having 6 survivors in the ECMO group versus just 1 in the standard group, meeting the threshold of prob(superiority)=0.9861 to stop the trial early.

@CritCareReviews

DISCLAIMER: obviously I am not qualified to give the clinician’s perspective on the trial as a whole. There was some excellent discussion @CritCareReviews

(e.g. is this just reflective of great outcomes at one highly expert center? Generalizable to other settings? Can we possibly trust a trial that randomized only 30 patients? And so on…)

My intent in this thread was simply to explain some of the adaptive features employed in the proposed design, the different ways this trial can unfold, and the pros & cons of using such a design versus a more conventional parallel-group design.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Andrew Althouse

Try unrolling a thread yourself!

More from @ADAlthousePhD

Andrew Althouse

Andrew Althouse

Andrew Althouse

Andrew Althouse

Andrew Althouse

Andrew Althouse

Did Thread Reader help you today?

Like this author's thread?