Having one of those mornings where you realize that it's sometimes a lot more work to be a good scientist/analyst than a bad one.
(Explanation coming...)
Processing some source data that could just be tabulated and summarized with no one the wiser, thereby including some obviously impossible data points, e.g. dates that occurred before study began, double-entries, things of that nature.
Not exactly an original observation here, but when we talk about issues with stats/data analysis done by non-experts, this is often just as big of an issue (or a bigger issue) than whether they used one of those dumb flow diagrams to pick which analysis to do.
It would be *so* easy to just blow right past the meticulous double checking for duplicate entries, impossible dates, and go straight to running summary stats and models. And I'm guessing that's often what happens. Almost no way that's ever actually picked up later.
I'm not sure what to do about this other than tell people "do careful checks of the source data and cleaning and processing steps en route to creating your final analysis dataset." But please, if you analyze data, do this.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
As promised last week, here is a thread to explore and explain some beliefs about interim analyses and efficacy stopping in randomized controlled trials.
Brief explanation of motivation for this thread: many people learn (correctly) that randomized trials which stop early *for efficacy reasons* will tend to overestimate the magnitude of a treatment effect.
This sometimes gets mistakenly extended to believing that trials which stopped early for efficacy are more likely to be “false-positive” results, e.g. treatments that don’t actually work but just got lucky at an early interim analysis.
OK. The culmination of a year-plus, um, argument-like thing is finally here, and it's clearly going to get discussed on Twitter, so I'll post a thread on the affair for posterity & future links about my stance on this entire thing.
A long time ago, in a galaxy far away, before any of us had heard of COVID19, some surgeons (and, it must be noted for accuracy, a PhD quantitative person...) wrote some papers about the concept of post-hoc power.
I was perturbed, as were others. This went back and forth over multiple papers they wrote in two different journals, drawing quite a bit of Twitter discussion *and* a number of formal replies to both journals.
Inspired by this piece which resonated with me and many others, I'm going to run in a little different direction: the challenge of "continuing education" for early- and mid-career faculty in or adjacent to statistics (or basically any field that uses quantitative methods).
I got a Master's degree in Applied Statistics and then a PhD in Epidemiology. The truth is, there wasn't much strategy in the decision - just the opportunities that were there at the time - but Epi seemed like a cool *specific* application of statistics, so on I went
But then, as an early-career faculty member working more as a "statistician" than "epidemiologist" - I've often given myself a hard time for not being a better statistician. I'm not good on theory. I have to think really hard sometimes about what should be pretty basic stuff.
As more stuff continues to break on the @NEJM and @TheLancet papers using the Surgisphere 'data' there's another possibility which has occurred to me that I want to play out.
I've been poring over these numbers for a few days and have not yet found a purely "statistical" smoking gun: a mean that cannot exist, a confidence interval that can't exist, etc.
Thus far most of the prevailing sentiment that this data isn't real seems to come from anecdotal beliefs: not very much evidence that the company exists, insider knowledge of how hard it is to connect EHR data, etc.
Before we get started: many have pointed out some very legitimate reasons to be skeptical of how such a database could exist with so little record of the company's existence or infrastructure to support what would be an absolutely massive integration of EHR's around the world
Those are good points and people should continue to pursue them. I'm coming at this from another angle: I want definitive proof, or something like it, that these data cannot exist.
Lots of questions being raised about @Surgisphere data analyses in @NEJM and @TheLancet. Others have already done some good work on this...
...so I'm going to focus on something curious that I noticed when I decided to actually read these papers instead of just skimming (will totally admit that I had not been paying very close attention to this until yesterday).
@Surgisphere is supposedly integrating data from hundreds of hospitals around the world, all different continents, that is supposedly The Very Biggest Data if you read most of their description.