In science, people tend to be most interested in positive results — a manipulation changes what you are measuring, two groups differ in meaningful ways, a drug treatment works, that sort of thing.
Journals preferentially publish positive results that are statistically significant — they would be unlikely to have arisen by chance if there wasn't something going on.
Negative results, meanwhile, are uncommon.
Knowing that journals are unlikely to publish negative results, scientists don't bother to write them up and submit them. Instead they up buried file drawers—or these days, file systems.
These are the values of something called "z values" from over a million biomedical research papers.
What a weird distribution. Let's look a bit closer.
Without going into a lot of detail, we can view these scores as a measure of statistical support for a positive result. Values near zero indicate little or no support; values greater than 2 or so indicate statistical significance according to conventional thresholds (p<0.05).
We can reasonably conclude from this that there are a lot of studies sitting in file drawers. If everything was published, positive or negative, you might expect to see something rather like this. The shaded area represents the missing studies.
So what?
A bunch of boring stuff that didn't work didn't get published.
Who cares?
The problem is, these missing results can bias our view of what works and what doesn't.
If, in reading the literature, we only see the successes and not the failures, we may be drawn to incorrect conclusions about important scientific questions.
In one of my favorite studies, @eturnermd1 and colleagues looked at this phenomenon for studies of antidepressant efficacy.
Before I go on, I believe that antidepressants can work well for severe depression. Not all the time, and not without some tinkering. But they save lives.
Turner and his colleagues looked at what you would see if you went to the medical literature to look at clinical trials of antidepressants.
Studies above the line represent studies that found statistically significant benefits. Below the line, no benefits.
Looks great, right?
But the problem is, you're missing the studies that showed no result.
Erick was able to get access to these studies through the FDA regulatory process.
Adding those in, you get a really different picture.
I liken this to an iceberg. You normally see only the part above the waterline, but it can be a deadly mistake to assume there's nothing beneath.
What happened to all those missing trials below the waterline? Many of them—the ones shown in yellow below— simply didn't result in publications. They ended up in file drawers, so to speak.
What is perhaps more remarkable is what happened to other trials below the waterline. By "outcome shifting" — changing the success criteria that one is looking for after the results come come — the studies shown in blue were reframed as positive results and published.
None of this is to say that science is broken, corrupt, or anything like that. There are legitimate reasons not to fill the journals with reports of things that didn't work.
People are thinking hard where we can be misled by these missing results—and what we can do about it.
I've done some work in this area myself, showing how we can end up believing false things (on a small scale; I'm not talking climate change or vaccine safety here) if we don't publish enough of our negative findings.
In my view, this is an important area in what we call the "Science of Science", "Metascience", or "Metareseach."
When the pandemic relaxes its grip on my research attention, I look forward to returning to this area.
Addendum: A bit more technical, but important note about the z value figure. @Lakens points out that these data are mined from the literature. People may be reporting, but not quantifying, the negative results. That's true, at least in part.
But note that these z values are computed from confidence intervals (and other related data), not reported directly in the form of p or z values.
So I wouldn't expect the same selection bias in terms of what is reported quantitatively within a paper.
There's also the issue of what hypotheses get tested in the first place. Chances are, most people don't spend their time testing things they expect won't work. So if researchers have a good intuition about the results will be, we don't expect many tests of hypothesis around z=0.
Maybe if everything were published, we would see something more like this.
Maybe. I'm not convinced, but it's not a trivial issue I think, and I need to give it more careful thought.
[updated to be asymmetrical, per several useful comments]
For me the real take-home is that the "walls of the volcano" are extremely steep, even though the data were inferred from confidence intervals rather collected as directly reported z scores or p values.
I wouldn't fit this to a Gaussian and use that to estimate the exact magnitude of publication bias. But as a general illustration of the principles underlying publication bias in science, I think it's powerful.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
One of our key pieces of advice is to be careful of confirmation bias.
There's a thread going around about how the crop below is what happens when Twitter's use of eye-tracking technology to crop images is fed with data from a misogynistic society. I almost retweeted it. But…
…that story fits my pre-existing commitments about how machine learning picks up on the worst of societal biases. So I thought it was worth checking out.
While it would be fish in a barrel to drag this paper as a contribution to the pseudoscience of homeopathy, we'll largely pass on that here. More interestingly, this single paper illustrates quite a few of the points that we make in our forthcoming book.
The first of them pertains to the role of peer review as guarantor of scientific accuracy.
In our book we suggest that one never assume malice when incompetence is a sufficient explanation, and one never assume incompetence when an understandable mistake could be the cause.
Can we apply that here?
I bet we can.
A lot of cartographic software will choose bins automatically based on ranges. For example, these might be the 0-20%, 20-40%, 40-60%, 60-80%, and 80-100% bins.
As the upper bound changes over time, the scale slides much as we see here.
We've written several times about what we describe as Phrenology 2.0 — the attempt to rehabilitate long-discredited pseudoscientific ideas linking physiognomy to moral character — using the trappings of machine learning and artificial intelligence.
For example,, we've put together case studies on a paper about criminal detection from facial photographs...
From the book: "Mathiness refers to formulas and expressions that may look and feel like math—even as they disregard the logical coherence and formal rigor of actual mathematics."
(Admittedly the shock-and-awe factor is minimal here in this sum of two quantities)
"When an equation exists only for the sake of mathiness, dimensional analysis often makes no sense."
If Boris claimed the threat level was a *function* of these two quantities, fine. But to say it is a *sum* makes zero sense.