12,399 views

Andrew Althouse

@ADAlthousePhD

, 53 tweets, 9 min read

My Authors

@Surgisphere

@Surgisphere

Some excellent work here as more people pry into the @Surgisphere papers. I'm going to try to build on this a bit further...

https://twitter.com/mikejohansenmd/status/1267675115908669441

Before we get started: many have pointed out some very legitimate reasons to be skeptical of how such a database could exist with so little record of the company's existence or infrastructure to support what would be an absolutely massive integration of EHR's around the world

Those are good points and people should continue to pursue them. I'm coming at this from another angle: I want definitive proof, or something like it, that these data cannot exist.

This is a cousin of, but not exactly, what some others are doing just by pointing out that the data seem inconsistent with other studies or what we "know" about coronavirus...

...which is also *a* way to attack this, but for a disease that's only existed for a few months with such scattershot data, I am hesitant to say that we *know* anything about it to that extent.

So I'm going to look for clues in the papers that are not just "this looks suspicious" but "this data literally cannot exist" to see what I can find.

Several people have picked up on this Figure S1 from the NEJM paper on ACE and ARB in COVID patients. There are a couple things that add up to make this suspicious.

Pedantic minor comment first: they refer to "age is depicted in deciles" in the footnote but the age groups are labeled in units of 10 years with the top being "81+ years" - that is not using deciles to divide the data. Either sloppy terminology or something else is afoot.

But let's assume the "decile" thing was just sloppy label and they really meant age groups in 10-year increments with "81+" as the top group.

@mikejohansenmd

@mikejohansenmd

The reference group is "0-10 years" - which would be suspicious on its own, as @mikejohansenmd has pointed out, the share of young children hospitalized is a vanishingly small share of COVID cases.

This dataset is big-ish, but not *that* big (supposedly 8910 patients). And, the age distribution is supposedly mean 55.8 (SD 15.1) in the 515 nonsurvivors - which means there really have to be VERY few deaths in that age 0-10 group.

If there are very few deaths in the reference group, you'd expect the confidence intervals for each group compared to the reference group to be very wide (Mike has already shown some examples of this in other data)

The other thing that's at least making me a wee bit suspicious is the width of the confidence intervals. Here we have to go a bit into the statistical weeds.

Since confidence intervals for a logistic regression are computed on the log scale, the CI's are not "symmetric" about the point estimate unless you convert to the log scale.

ALSO: I want to emphasize that while I'm a medical statistician, I am not perfect, I may make mistakes or overlook something here, so if I goof up, happy for anyone to point out if I missed something or took a faulty step in logic.

But, I am assuming that Figure S1 comes from a logistic regression model (worth noting: the Results section never even mentions Figure S1 and the Methods is sufficiently vague that it's hard to be certain what they would be describing here)

So *if* the results of Figure S1 are a logistic regression model with "mortality" as a yes/no binary outcome, the age groups presented as labeled and compared to a reference group of age 0-10...

Let's actually use the width of the CI's to compute the standard error of each individual age group.

For example, age 81+ group, the point estimate is 5.076; 95% CI goes from 1.185-21.735.

Taking the natural log of these gives you a point estimate of 1.625 and CI bounds of 0.170 - 3.079.

Since a 95% CI is typically computed as +/- 1.96*SE from the point estimate, that implies the "SE" is about 0.74 for that CI: (1.625-0.170)/1.96 = 0.74

If you repeat that for age 71-80:

log-scale point estimate 4.928, CI 1.184-20.513

transformed point est 1.595, CI 0.169-3.021

estimated SE: (1.595-0.169)/1.96= about 0.73

Repeat this and the estimated SE for ALL of the age groups from 41-50 on up is between 0.72 and 0.74.

That feels...odd. I'll need to think about this some more; it's not proof yet but my gut reaction is that this feels surprising to see SE's that are so consistent across the age groups.

So now let's combine this with a few other bits from the paper to see if we can figure out how this data can exist. We don't know the overall age distribution, but the mean and SD for "nonsurvivors" and "survivors" are both reported, as well as the n (%) above age 65.

I'm trying to figure out how to piece together a dataset with the right age distribution for both "nonsurvivors" and "survivors" with the right n (%) above age 65 that could possible create the OR's and CI's reported here. That's where I am at the moment.

@sTeamTraen

@sTeamTraen

I wish I was a better data thug like @sTeamTraen and @jamesheathers (I''m sure at some point they get tired of being tagged in every thread, but I must admit their work is what even gives me the idea to try and pick it apart this way).

@sTeamTraen

@sTeamTraen

But, still a work in progress. Will update once I have an idea. Using @sTeamTraen SPRITE tool to create age distributions with the right N, mean, SD, and then will have to play with them some to see what OR's and CI's would look like for the age distribution.

I will say, my first few SPRITE generated datasets make it hard to see how this age distribution can have anywhere near enough 0-10 / 11-20 for that graph to be real, but I haven't *proven* it yet, still in "this doesn't seem possible" territory.

OK, trying another angle of attack here to winnow down some of the possibilities.

Let's focus on the OR and CI for the 11-20 age group against the 0-10 age group. OR=0.415, 95% CI 0.068-2.521. What is the smallest number of deaths/cases in the groups that could be compatible with this OR and CI width?

@sTeamTraen

@sTeamTraen

This has sort of a GRIM feel to it, @sTeamTraen. OR's when the number of events are very tiny can only take discrete quantities...

(frustratingly, the Figure S1 legend is vague and does not specify if there was a multivariable regression model used here, or if these are unadjusted OR's and CI's; if it was a regression, that could add some more noise to what I am about to say...)

but...the minimum number of deaths in each age group (0-10 and 11-20) that can possibly create that OR and CI in an unadjusted analysis looks to be 3 deaths in each group.

only 2 deaths per group and you can't come up with a CI as narrow as 0.068-2.521 (at least, no combination that I have tried with fewer than 3 deaths in each group can give a CI that narrow - if I've missed one, happy to walk this back!)

The smallest compatible numbers that I can come up with for that OR and CI are about 3 deaths/10 patients in the 0-10 year olds and 3 deaths/20 patients in the 11-20 year olds (OR=0.41, 95% CI 0.066-2.55)

But that can't be true, either. Because the overall mortality rate in the paper is 515/8910 = about 5.8% - and the mortality must be HIGHER in all of the higher age groups than the reference group (0-10).

OK. So we are still on the hunt for a number of deaths / patients in the 0-10 and 11-20 age groups that can make an OR=0.415, 95% CI 0.068-2.521...

...BUT ALSO needs to have a mortality lower than 5.8% for the 0-10 group (quite a bit lower, actually, I'd think) such that the mortality in higher age groups will be higher than the 0-10 group.

OK. Let's make it 3 deaths out of 73 patients in the 0-10 age group and 3 deaths out of 173 patients in the 11-20 age group. Now we have OR=0.41, 95% CI 0.08-2.09 for death in 11-20 versus 0-10. This CI is now narrower than the CI reported in Figure S1.

And it still (probably?) has a mortality rate that's too high for 0-10 to serve as the reference group.

UPDATE: I realized that, silly me, I might be able to get to that OR and CI with fewer than 3 deaths in one group. Working on this now.

Relevant: nejm.org/doi/full/10.10…

OK. Back on this. So, I realized that one could get very close to an OR of 0.415 with 95% CI of 0.068-2.51 if there are 2 deaths of 243 patients in age 11-20 versus 3 deaths of 153 patients in age 0-10.

I'm going to painstakingly go up the age scale to see if one can get the right OR and CI widths in each age group vs. a group with 3 deaths in 153 patients from age 0-10.

DISCLAIMER: I am not saying this is guaranteed to be right just yet, I am just trying to reasonably re-create a dataset that would match the properties of the OR's and CI's in Figure S1 along with the reported age characteristics of the dataset.

OK. So I am fairly confident that the OR's and CI's for the upper age groups cannot exist if the 0-10 group has 3 deaths in 153 patients and the 11-20 group has 2 deaths in 241 patients (again, I am starting with those...

...because I need a combination that gives the correct OR and CI for the 11-20 group against the 0-10 group while also having a low enough mortality rate that the majority of the upper age groups have higher mortality than the 0-10 group)

But, frankly, for the CI's to be as *wide* as they are when compared against groups of that size, the upper age groups have to be far too small for the OR and CI to match versus a reference group with 3/153.

Which seems to suggest that the number of events and cases in the age 0-10 group has to be smaller. So now I need a number of events/cases in the 0-10 age group that a) has lower within-group mortality than 5% and b) can generate an OR/CI of 0.415 (CI 0.068-2.512) for the 11-20

This is where the sleuthing gets fun...there are only so many possible combinations left at this point which can give the right OR and CI width for the 11-20 group against the 0-10 group.

if there were 4 deaths/24 patients in the 0-10 group and 2 deaths/26 patients in the 11-20 group, that gets you pretty close to the correct OR and CI.

I've tried smaller numbers (e.g. only 1 death in one of the groups) but then it's really hard to get the CI to match without the other group being a bit cartoonish.

Enjoying this thread?

Try unrolling a thread yourself!

Enjoying this thread?

Try unrolling a thread yourself!

More from @ADAlthousePhD see all

Embed code for your website

Did Thread Reader help you today?