My Authors
Read all threads
Just because it bothers me so much, a short thread on why a false discover rate is not sufficient to make a binary decision.
A common motivation in model comparison is identifying which of two statistical models is most consistent with a given observation. Let's call one the "null hypothesis" and the other the "alternative hypothesis".
In certain completely arbitrary scientific fields the null hypothesis might be called the "background model" and the alternative hypothesis would be the "background plus signal model". Just an example...
The temptation is to just look at the consistency of the observed data and the null hypothesis. If the observed data is sufficiently rare then it's unlikely to have come from the null hypothesis, no? Maybe, but not necessarily.
Sometimes we just observe tail events. There are _lots_ of experiments out there being cut up into _lots_ of individual analyses. Tail events happen.
What we really need to do is _compare_ how rare an observation is relative to the null _and_ alternative hypotheses. If the observation is rare in both hypotheses then neither can be rejected or accepted as the most compatible with the observed data.
If we want to reject the null hypothesis in favor of the alternative hypothesis then we need to show that the alternative hypothesis is much more consistent with the observed data!
Oh, but there's a problem. The alternative hypothesis is typically comprised of many different hypotheses. If the null hypothesis is the background only model then the alternative is background plus _all possible configurations of the signal_.
When we compare how compatible the alternative hypothesis is with the data we need to consider all of those model configurations, including the ones where the signal is so vanishingly small that it wouldn't have any influence relative to the background.
In other words there will always be model configurations in the alternative hypothesis that are just as inconsistent with the observed data as the null hypothesis.
From a worst-case perspective, like frequentist significance testing, this means that we can never really distinguish between the two models for any finite observation. This is why null hypothesis significance testing secretly falls apart for nested null/alternative hypothesis.
To have any discriminating power one needs to refine the alternative hypothesis, either deterministically with hard cuts or probabilistically with Bayesian priors.
When people say "OMG N sigma" or "OMG p < 0.001" what they're really saying is that "this would be an extreme deviation for the null hypothesis but it wouldn't be anywhere near as extreme for the relevant model configurations in the alternative hypothesis".
The problem is that the implicit definition of alternative hypothesis and, in particular, the "relevant" model configurations in the alternative hypothesis, makes it impossible to understand the assumptions being made and hence the validity of the statement.
Oh, and this is just the tip of the iceberg of "common mistakes in model comparison". Even after you better pose your competing hypotheses and try to consider the relationship of both to the observed data you have to deal with...
Empirical comparisons just determine which hypothesis is more consistent with the observed data _but not necessarily the truth_ <enter the overfitting chorus>. To understand the relationship you need to _calibrate_ your decision making process,
This is why, for example, model selection using Bayes factors can be so incredibly bad. For more on calibrating model comparison processes see arxiv.org/abs/1803.08393.
The second big issue is computation. In practice neither the null nor alternative hypothesis will be given by a single model configuration (systematic and measurement effects!) which makes accurately computing measures of extremely, like p-values and Bayes factors, very hard.
Anyways, model comparison is really hard and in practice. Fortunately it's not really needed when all we actually want to know is just how much signal is consistent with the observed data so we can individually decide if the that amount is relevant or not.
Missing some Tweet in this thread? You can try to force a refresh.

Keep Current with \mathfrak{Michael "El Muy Muy" Betancourt}

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!