\mathfrak{Michael
Once and future physicist masquerading as a statistician. Reluctant geometer. @mcmc_stan developer. Support my writing at https://t.co/Ut25MyjIAy. He/him.

Aug 12, 2021, 21 tweets

This is an important question that hits on some of the crucial differences between the idealizations of Bayesian inference that are usually taught in introductory classes and how Bayesian inference is actually implemented in practice. A short thread!

One of the nice theoretical properties of Bayesian updating (i.e. the application of Bayes' Theorem in Bayesian inference) is that it's compatible with any _product structure_ of the observational model.

If the observational model contains two independent measurements then any realized likelihood function will be given by the product of two component likelihood functions. Bayes' Theorem allows us to update the prior distribution into a posterior distribution using both...

...or one at a time, creating an intermediate posterior distribution from the first likelihood function that becomes the prior distribution for the construction of the final posterior distribution from second likelihood function. Either way we get the same posterior distribution!

This associativity of independent measurements also carries over to probability density functions -- we can construct the final posterior density function all at once or in two steps, creating an intermediate posterior density function in the process.

This even applies to Stan programs! If you have a program that implements log pi(y_1 | theta) pi(theta) then you just add log pi(y_2 | theta) to the model block to implement log pi(y_1, y_2 | theta) pi(theta) = log pi(y_1 | theta) + log pi(y_2 | theta) + log pi (theta)!

So what's the problem? Well despite what many introductory classes will have you believe posterior density functions actually aren't all that useful in practice.

In order to extract meaningful inferences from a posterior distribution (normalized or not!) we need to evaluate expectation values of functions, or equivalently integrate those functions against the posterior density function.

I know, you all hate thinking about expectation values but they cannot be avoided if you want to do produce a valid analysis! All well-posed summaries are derived from expectation values, such as the moments and quantiles and histograms that characterize marginal distributions!

The problem is that integrating function is haaaaaaaard, especially when the model configuration space is high dimensional. In most cases we can't do anything analytical and we have to appeal to approximations like the sampling-based approximation Markov chain Monte Carlo.

Markov chain Monte Carlo, and any software that implements a Markov chain Monte Carlo method, is designed to approximate expectation values _and nothing else_. When these approximations are good enough we get a faithful picture of the information contained in our posterior dist.

Sampling-based methods, however, are not compatible with the product structure of the observational model. There's no natural way to transform samples from pi(theta | y_1) into samples from pi(theta | y_1, y_2)!

Indeed designing a way to transform samples from one distribution to another is just as hard -- in most cases actually much harder -- than generating samples from the initial distribution in the first place!

So what are we left to do? All of the methods that are commonly recommended -- estimate moments and then pick a matching density from a fixed family, construct a --shudder-- kernel density estimator, etc -- all introduce serious problems that are hard to diagnose.

In more than a few dimensions all these heuristics can do at best is capture the propagate features of the posterior distribution. Most of the details will be lost in translation, and we won't get the right final posterior distribution anyways.

But here's the thing -- why would you try to approximate a density function with samples _when you already have the density function_? The Stan program used to generate samples from pi(theta | y_1) already implements the density function pi(theta | y_1) up to normalization!

In particular you have a Stan program that implements pi(y_1, theta) and code for pi(y_2 | theta) then appending the model blocks will give code that implements pi(y_1, y_2, theta)! Then you can characterize the final posterior distribution by running Stan.

That's it. That's all you have to do. No density estimators, no moment matching. Just put all of the data together and fit it all once. This even works when the measurements aren't completely independent -- see for example betanalpha.github.io/assets/case_st….

The _only time_ you need to consider updating a computational intermediary like samples is when you can't access previous data, such as when data are streaming and not saved. This requires the heavier machinery of Sequential Monte Carlo which is much more difficult to implement.

Anyways, this is why I'm always going on and on about the foundations of probability theory. Without that scaffolding you can't separate out _what_ you're trying to approximate and _how_ you're approximating it, making it very easy to confuse the two and fall to bad heuristics.

There are a million courses, tutorials, and blog posts out there that promise to teach you all you need to know about Bayesian inference in just a few minutes. In very narrow circumstances those lessons might be relevant, but how do you know when you're in those circumstances?

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling