12,399 views

\mathfrak{Michael "El Muy Muy" Betancourt}

@betanalpha

, 18 tweets, 4 min read

My Authors

https://twitter.com/economeager/status/1217078205829521410

https://twitter.com/economeager/status/1217078205829521410

What a wonderful opportunity to talk about experimental design and the subtleties of what overfitting means in a Bayesian context! A "I've been consulting all day and my brain is too tired to do real work" thread.

https://twitter.com/economeager/status/1217078205829521410

Overfitting and identifiability are intimately related concepts. When you have a complex model but only small data sets then there will be _many_ model configurations consistent with the little data that you observed.

If your inferences are quantified by a point estimate then you will have to choose a single point amongst the entire subset of model configurations that are similarly consistent with any given observation. Any choice of a single point, however, is unlikely to generalize well.

In order to generalize well one has to quantify _all_ of those nearly equivalent model configurations. How do we quantify a subspaces of nearly equivalent model configurations? Probability distributions are a wonderful possibility, which is why Bayesian inference is so powerful!

But it's one thing to _say_ that a Bayesian posterior quantifies which model configurations are consistent with the observed data. Isn't entirely another to accurately compute the extent of a posterior that spans a complex subspace of nearly equivalent model configurations.

The more degenerate a model the harder the Bayesian computation will be, which is often why you often hear people say that a bad fit indicates a modeling problem. It's also why algorithms with sensitive fitting diagnostics are so important for robust inference. #divergences4lyfe

How do you know whether the measurements from a given experiment will sufficiently inform a phenomenological model? You have to analyze the _experimental design_. In practice that means analyzing simulated observations and seeing how often they are sufficiently informative.

If a given experiment doesn't provide enough information _in expectation_ then you need to collect more data, or make complementary observations, or supplement the analysis with more domain expertise. Even then you might get unlucky with a particularly uninformative observation.

In summary: from a statistical perspective overfitting manifests as degenerate likelihoods and hence, in a Bayesian analysis, degenerate/poorly-identified posteriors.

If your model doesn't force overly complex models, i.e. if it includes simpler models, then an accurate quantification of those predictions generated from those degenerate posteriors will be particularly robust to the overfitting.

Accurate quantification of degenerate posteriors, however, is hard, even for sophisticated tools like dynamic Hamiltonian Monte Carlo. Your best friend in these circumstances are algorithm that can identify when they can't fit. Like Hamiltonian Monte Carlo!

Those diagnostics, combined with careful experimental design via simulation studies, can inform when you are at risk of overfitting and hence when you might need more informative observational models or more informative prior models.

In that sense overfitting is the consequence of careless analysis, reflecting more on the priorities of the analyzer than any particular method chosen.

I forgot one of the most important caveats! Your experiment design analysis is only useful if your experimental design is well implemented. If the practical implementation of your experiment is very different from the design then those expectations will be nearly worthless.

This is why "blinding" and "preregistration" are so insufficient. You'll never know how poorly your experimental design was implemented, and how sophisticated of an experimental model you'll need, until analyzing the actual data.

Awkward high fives to the optimists who think that they can precisely predict how an experiment will play out, and all of the subtle systematic effects that have to be modeled, without being there or looking at the actual data.

Then again maybe I'm the only one who's ever had to deal with the influence of cross talk on detector readouts, paint fumes on mosquito behaviors, selection bias on reported results, multiple response behaviors, etc. 🤷‍♂️

Enjoying this thread?

Keep Current with \mathfrak{Michael "El Muy Muy" Betancourt}

Stay in touch and get notified when new unrolls are available from this author!

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Enjoying this thread?

Try unrolling a thread yourself!

Related hashtags

More from @betanalpha see all

Related threads

Trending hashtags

Did Thread Reader help you today?