Tweet

Calling Bullshit

5 Dec, 23 tweets, 7 min read

In science, people tend to be most interested in positive results — a manipulation changes what you are measuring, two groups differ in meaningful ways, a drug treatment works, that sort of thing.

Journals preferentially publish positive results that are statistically significant — they would be unlikely to have arisen by chance if there wasn't something going on.

Negative results, meanwhile, are uncommon.

Knowing that journals are unlikely to publish negative results, scientists don't bother to write them up and submit them. Instead they up buried file drawers—or these days, file systems.

This is known as the file drawer effect.

(Here p<0.05 indicates statistical significance.)

@urbancic

I was taken by a figure that @urbancic sent my way today, from a paper posted recently on the arXiv. arxiv.org/abs/2009.09440

These are the values of something called "z values" from over a million biomedical research papers.

What a weird distribution. Let's look a bit closer.

Without going into a lot of detail, we can view these scores as a measure of statistical support for a positive result. Values near zero indicate little or no support; values greater than 2 or so indicate statistical significance according to conventional thresholds (p<0.05).

We can reasonably conclude from this that there are a lot of studies sitting in file drawers. If everything was published, positive or negative, you might expect to see something rather like this. The shaded area represents the missing studies.

So what?

A bunch of boring stuff that didn't work didn't get published.

Who cares?

The problem is, these missing results can bias our view of what works and what doesn't.

If, in reading the literature, we only see the successes and not the failures, we may be drawn to incorrect conclusions about important scientific questions.

@eturnermd1

In one of my favorite studies, @eturnermd1 and colleagues looked at this phenomenon for studies of antidepressant efficacy.

Before I go on, I believe that antidepressants can work well for severe depression. Not all the time, and not without some tinkering. But they save lives.

Turner and his colleagues looked at what you would see if you went to the medical literature to look at clinical trials of antidepressants.

Studies above the line represent studies that found statistically significant benefits. Below the line, no benefits.

Looks great, right?

But the problem is, you're missing the studies that showed no result.

Erick was able to get access to these studies through the FDA regulatory process.

Adding those in, you get a really different picture.

I liken this to an iceberg. You normally see only the part above the waterline, but it can be a deadly mistake to assume there's nothing beneath.

(Photo: Wikimedia Commons)

BTW, here's the study in question: nejm.org/doi/full/10.10…

What happened to all those missing trials below the waterline? Many of them—the ones shown in yellow below— simply didn't result in publications. They ended up in file drawers, so to speak.

What is perhaps more remarkable is what happened to other trials below the waterline. By "outcome shifting" — changing the success criteria that one is looking for after the results come come — the studies shown in blue were reframed as positive results and published.

None of this is to say that science is broken, corrupt, or anything like that. There are legitimate reasons not to fill the journals with reports of things that didn't work.

People are thinking hard where we can be misled by these missing results—and what we can do about it.

I've done some work in this area myself, showing how we can end up believing false things (on a small scale; I'm not talking climate change or vaccine safety here) if we don't publish enough of our negative findings.

elifesciences.org/articles/21451

In my view, this is an important area in what we call the "Science of Science", "Metascience", or "Metareseach."

When the pandemic relaxes its grip on my research attention, I look forward to returning to this area.

@Lakens

Addendum: A bit more technical, but important note about the z value figure. @Lakens points out that these data are mined from the literature. People may be reporting, but not quantifying, the negative results. That's true, at least in part.

https://twitter.com/lakens/status/1335122795257327616

But note that these z values are computed from confidence intervals (and other related data), not reported directly in the form of p or z values.

So I wouldn't expect the same selection bias in terms of what is reported quantitatively within a paper.

There's also the issue of what hypotheses get tested in the first place. Chances are, most people don't spend their time testing things they expect won't work. So if researchers have a good intuition about the results will be, we don't expect many tests of hypothesis around z=0.

Maybe if everything were published, we would see something more like this.

Maybe. I'm not convinced, but it's not a trivial issue I think, and I need to give it more careful thought.

[updated to be asymmetrical, per several useful comments]

For me the real take-home is that the "walls of the volcano" are extremely steep, even though the data were inferred from confidence intervals rather collected as directly reported z scores or p values.

I wouldn't fit this to a Gaussian and use that to estimate the exact magnitude of publication bias. But as a general illustration of the principles underlying publication bias in science, I think it's powerful.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @callin_bull

Calling Bullshit

@callin_bull

3 Dec

Jevin West was away today so in lecture I was able to sneak in one my favorite topics, observation selection effects.

Let's start a little puzzle.

In Portugal, 60% of families with kids have only one child. But 60% of kids have a sibling.

How can this be?

@TimScharks

People are all over this one! And some are out ahead of me (looking at you, @TimScharks). We'll get there, I promise!

There are fewer big families, but the ones there are account for lots of kids.

If you sampled 20 families in Portugal, you'd see something like this.

@TimScharks

@TimScharks Now let's think about class sizes.

Universities boast about their small class sizes, and class sizes play heavily into the all-important US News and World Report college rankings.

For example, @UW has an average class size of 28.

Pretty impressive for a huge state flagship.

Read 15 tweets

Calling Bullshit

@callin_bull

20 Sep

One of our key pieces of advice is to be careful of confirmation bias.

There's a thread going around about how the crop below is what happens when Twitter's use of eye-tracking technology to crop images is fed with data from a misogynistic society. I almost retweeted it. But…

@techreview

…that story fits my pre-existing commitments about how machine learning picks up on the worst of societal biases. So I thought it was worth checking out.

Turns out, it's not Twitter at all.

Here's the @techreview tweet itself:

https://twitter.com/techreview/status/1282997462794412032

@techreview

The picture is provides as a "twittercard", and is provided by the publisher, @techreview, as part of the header in the html file for the article.

Read 8 tweets

Calling Bullshit

@callin_bull

26 Jul

A couple of months ago, an almost unfathomably bad paper was published in the Journal of Public Health: From Theory to Practice.

It purports to prove—mathematically—that homeopathy will provide and effective treatment for COVID-19.

link.springer.com/article/10.100…

While it would be fish in a barrel to drag this paper as a contribution to the pseudoscience of homeopathy, we'll largely pass on that here. More interestingly, this single paper illustrates quite a few of the points that we make in our forthcoming book.

The first of them pertains to the role of peer review as guarantor of scientific accuracy.

In short, it's no guarantee, as we discuss here: callingbullshit.org/tools/tools_le…

This paper shows that all sorts of stuff makes it through peer review.

Read 50 tweets

Calling Bullshit

@callin_bull

17 Jul

https://twitter.com/andishehnouraee/status/1284237474831761408

A truly remarkable example of misleading data visualization from the Georgia department of public health.

https://twitter.com/andishehnouraee/status/1284237474831761408

In our book we suggest that one never assume malice when incompetence is a sufficient explanation, and one never assume incompetence when an understandable mistake could be the cause.

Can we apply that here?

I bet we can.

A lot of cartographic software will choose bins automatically based on ranges. For example, these might be the 0-20%, 20-40%, 40-60%, 60-80%, and 80-100% bins.

As the upper bound changes over time, the scale slides much as we see here.

Read 5 tweets

Calling Bullshit

@callin_bull

25 Jun

We've written several times about what we describe as Phrenology 2.0 — the attempt to rehabilitate long-discredited pseudoscientific ideas linking physiognomy to moral character — using the trappings of machine learning and artificial intelligence.

For example,, we've put together case studies on a paper about criminal detection from facial photographs...

callingbullshit.org/case_studies/c…

...and on another paper about detection of sexual orientation from facial structure.

(tl;dr — both are total bullshit)

callingbullshit.org/case_studies/c…

Read 11 tweets

Calling Bullshit

@callin_bull

10 May

@merz

R is a dimensionless constant. Infections is measured in individuals. So what the f*ck are the units then?

This is the kind of bullshit we call “mathiness” in our forthcoming book.

H/t @merz.

https://twitter.com/BorisJohnson/status/1259572964447653892

From the book: "Mathiness refers to formulas and expressions that may look and feel like math—even as they disregard the logical coherence and formal rigor of actual mathematics."

(Admittedly the shock-and-awe factor is minimal here in this sum of two quantities)

"When an equation exists only for the sake of mathiness, dimensional analysis often makes no sense."

If Boris claimed the threat level was a *function* of these two quantities, fine. But to say it is a *sum* makes zero sense.

Read 7 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

Calling Bullshit

Try unrolling a thread yourself!

More from @callin_bull

Calling Bullshit

Calling Bullshit

Calling Bullshit

Calling Bullshit

Calling Bullshit

Calling Bullshit

Did Thread Reader help you today?

Like this author's thread?