Tweet

Andrew Althouse

Follow @ADAlthousePhD

Nov 1, 2018 • 55 tweets • 7 min read

https://twitter.com/ADAlthousePhD/status/1053355096267022336

TOPIC: P-Values In Table 1 of RCT's. Time to revisit this poll.

https://twitter.com/ADAlthousePhD/status/1053355096267022336

Thanks very much to the clinicians that responded. This came out better than expected, albeit the selection bias of “clinicians that follow statisticians on Twitter” suggests that the respondents are collectively better versed in data analysis than general research population

Anyways, putting p-values in Table 1 of RCT’s is an inappropriate use of significance testing, yet remains prevalent in medical literature, because it SEEMS to make so much sense (at least, the way most people have been taught p-values and statistical significance…)

There are two separate problems here:

(1): the mistaken belief that perfect “baseline balance” is necessary for a treatment comparison to be valid, and that any deviations from such balance (presumably, as shown by p<0.05’s in the baseline table) undermine the trial’s primary comparison.

(2): the mistaken belief that a p-value offers meaningful information which helps assess danger to problem (1)

We’ll cover problem (2) first, then return to (1) later - it’s more complicated.

Anyway, what’s the deal with p-values comparing the randomized treatment arms?

Let’s explain using the two-sample t-test, which is meant to determine whether the observed data are consistent with an assumption (null hypothesis) that the “population mean” from which one sample was drawn is equal to the “population mean” from which the other was drawn.

Ex: if one wishes to test whether people with brown eyes tend to be taller than people with blue eyes, one might recruit a random sample of people with brown eyes and people with blue eyes, measure the heights in each, and perform a two-sample t-test comparing the sample means.

In this setting, a p-value represents the probability that the observed difference in our sample data would occur under the null hypothesis (in this case, “the height of brown-eyed people is equal to the height of blue-eyed people” in the population of interest)

Suppose that the p-value was 0.01, meaning (using common language…) there was only a 1% chance of observing a difference this large in our sample if the 2 populations of interest (brown-eyed people vs blue-eyed people) actually share a common distribution of height.

Since it was unlikely to observe this difference under Ho, so we “reject” the null hypothesis that brown-eyed people and blue-eyed people share a common distribution of height, concluding that…

…our sample provides evidence that the population distribution of height is different for brown-eyed people than it is for blue-eyed people.

Now, let’s talk about a common Table-1-of-an-RCT scenario: comparison of the mean age for participants assigned to Drug A versus Drug B. Performing a two-sample t-test comparing mean age in patients assigned to Drug A versus mean age in patients assigned to Drug B is testing…

…the probability of observing this data under the null hypothesis that the “population mean” age of participants in the group assigned to Drug A is equal to the “population mean” age of participants in the group assigned to Drug B.

However, the participants assigned to Drug A came from the *same population* as participants assigned to Drug B (patients meeting inclusion criteria that enrolled in the trial).

The patients assigned to each treatment arm share the *same* population distribution for all baseline variables because they are selected from a common population, then randomly split into 2 (or however many) groups.

As the late Doug Altman said: “performing a significance test to compare baseline variables [my addition: in an RCT] is to assess the probability of something having occurred by chance when we know that it did occur by chance.”

Furthermore, as Frank Harrell has pointed out, substituting randomly generated numbers for all baseline variables would show a handful “baseline differences” between the populations, too. No one would suggest that we need to “adjust” for the randomly generated numbers.

This was covered repeatedly in the statistical literature in the 1980’s and 1990’s:

Altman DG. Comparability of randomized groups. J Royal Stat Soc 1985; 34: 125-136.

Altman DG, Dore CJ. Randomisation and baseline comparison in clinical trials. Lancet 1990; 335: 149-153.

Senn SJ. Baseline comparisons in randomized clinical trials. Stat Med 1991; 10: 1157-1160.

Senn SJ. Testing for baseline balance in clinical trials. Stat Med 1994; 13: 1715-1726.

Begg CB. Significance tests of covariate imbalance in clinical trials. Controlled Clin Trials 1990; 11: 223-225.

CONSORT guidelines recommend against it: “significance testing of baseline differences in randomized controlled trials (RCTs) should not be performed, because it is superfluous and can mislead investigators and their readers”

Moher D, Hopewell S, Schulz KF, Montori V, Gøtzsche PC, Devereaux PJ, et al. CONSORT 2010 Explanation and Elaboration: Updated guidelines for reporting parallel group randomised trials. J Clin Epidemiol 2010; 63: e1–37.

And yet, despite the marvelous papers by Altman, Begg, Senn, and others on the subject: many researchers (clinicians and statisticians both) still believe that p-values comparing the randomized treatment arms are the appropriate first step in analyzing / interpreting an RCT.

I’ve seen reviewers and editors ask for it. A team of researchers (de Boer et al, cited below) described their experience thusly:

“…in our submitted papers we followed the CONSORT statement in not testing for baseline differences. However, after submission of the papers we were again faced with comments that tests of baseline differences should be added, but now from reviewers or even editors.”

“To our surprise and dismay, these reviewers insisted on this point even after we had provided a logical explanation why we preferred not to present these p-values. Eventually, we decided to add the tests and as a result they are included in all four of our publications.”

Reference: de Boer MR, Waterlander WE, Kuijper LDJ, Steenhuis IHM, Twisk JWR. Testing for baseline differences in randomized controlled trials: an unhealthy research behavior that is hard to eradicate. Int J Behav Nutr Phys Act 2015; 12: 4.

So, yeah, this is definitely still a problem in the medical literature.

The #medtwitter crowd that frequently engages with Harrell, Senn, et al probably has started to come around on this. My little survey is likely biased since my sphere of Twitter influence/conversation includes mostly MD’s that engage with statisticians.

Strong suspicion (unproven, of course) that this survey carried out in many academic-medicine departments would have >80% of people answering that the p-values in Table 1 are necessary because you need to compare the randomized treatment arms and/or check for balance.

But please, share this information (the Twitter thread for the lay version; the Senn, Altman, de Boer papers for more technical and professional explanation) with your friends and colleagues

Together, maybe, one day, people will stop asking about p-values in Table 1 of RCT’s

“Why didn’t you write a paper about this?” – I’m working on it. But this isn’t really new content. Statisticians published a bunch of these papers in the 1980’s and 1990’s. The trick is getting people to read them, and then to *change their thinking* as a result.

With respect to Senn, Altman, and others (all fine writers) it seems that papers published in statistics journals tend to be ignored or trivialized by the clinical research world. Understandably, to a degree, there are a hundred medical journals they’re trying to keep up with.

It’s time to try communicating in multiple fronts, both traditional and nontraditional.

Oh, as for this: “(1) the mistaken belief that perfect “baseline balance” is necessary for a treatment comparison to be valid, and that any deviations from such balance (presumably, as shown by p<0.05’s in the baseline table) undermine the trial’s primary comparison.”

We’ll come back to that another time. A few brief thoughts:

Paraphrasing Harrell: Statistical inference is based on probability distributions. It is sufficient to know that the tendency was for baseline covariates to be balanced, because it is the tendency on which assumptions of the statistical tests are based.

Paraphrasing Senn: The probability calculation applied to a clinical trial automatically makes an allowance for the fact that groups will almost certainly be unbalanced, and if one knew that they were balanced, then the calculation that is usually performed would not be correct.

And finally, focus on “baseline balance” has impeded discussion of a more productive step in analysis of RCT that would alleviate some of these concerns: pre-specifying baseline covariates that should be adjusted for in the final treatment comparison.

Many, many, many papers in the clinical epi/statistics literature have discussed this:

Canner PL. Covariate adjustment of treatment effects in clinical trials. Controlled Clin Trials 1991; 12: 359-366.

Tukey JW. Tightening the Clinical Trial. Controlled Clin Trials 1993; 14: 266-285.

Neuhaus JM. Estimation Efficiency with Omitted Covariates in Generalized Linear Models. J Am Stat Assoc 1998; 93: 1124-1129.

Hauck WW, Anderson S, Marcus SM. Should We Adjust for Covariates in Nonlinear Regression Analyses of Randomized Trials? Controlled Clin Trials 1998

Steyerberg EW, Bossuyt PMM, Lee KL. Clinical trials in acute myocardial infarction: should we adjust for baseline characteristics? Am Heart J 2000; 139(5): 745-751.

Hernandez AV, Steyerberg EW, Habbema JDF. Covariate adjustment in randomized controlled trials with dichotomous outcomes increases statistical power and reduces sample size requirements. J Clin Epi 2004; 57(5): 454-460.

Hernandez AV, Eijkemans MJC, Steyerberg EW. Randomized controlled trials with time-to-event outcomes: How much does prespecified covariate adjustment increase power? Ann Epi 2006; 16(1): 41-48.

Gray LJ, Bath P, Collier T. Should stroke trials adjust for functional outcome for baseline prognostic factors? Stroke 2009

Kent DM, Trikalinos TA, Hill MD. Are unadjusted analyses of clinical trials inappropriately biased toward the null? Stroke 2009

Lingsma H, Roozenbeek B, Steyerberg E. Covariate adjustment increases statistical power in randomized controlled trials. J Clin Epi 2010; 63(12): 1391.

Groenwold RHH, Moons KGN, Peelen LM, Knol MJ, Hoes AW. Reporting of treatment effects from randomized trials: A plea for multivariable risk ratios. Contemp Clin Trials 2011; 32(3): 399-402.

Ciolino JD, Martin RH, Zhao W, Jauch EC, Hill MD, Palesch YY. Covariate imbalance and adjustment for logistic regression analysis of clinical trial data. J Biopharm Stat 2013; 23(6): 1383-1402.

Thompson DD. Lingsma HF. Whiteley WN, Murray GD, Steyerberg EW. Covariate adjustment had similar benefits in small and large randomized controlled trials. J Clin Epi 2015; 68(9): 1068-1075.

Lee PH. Covariate adjustments in randomized controlled trials increased study power and reduced biasedness of effect size estimation. J Clin Epi 2016; 76(1): 137-146.

Jiang H, Kulkarni PM, Mallinckrodt CH, Shurzinkse L, Molenbergs G, Lipkovich I. Covariate Adjustment for Logistic Regression Analysis of Binary Clinical Trial Data. Stat Biopharm Res 2017; 9(1): 126-134.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Andrew Althouse

Try unrolling a thread yourself!

More from @ADAlthousePhD

Andrew Althouse

Andrew Althouse

Andrew Althouse

Andrew Althouse

Andrew Althouse

Andrew Althouse

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Like this author's thread?