This is an important question that hits on some of the crucial differences between the idealizations of Bayesian inference that are usually taught in introductory classes and how Bayesian inference is actually implemented in practice. A short thread!
One of the nice theoretical properties of Bayesian updating (i.e. the application of Bayes' Theorem in Bayesian inference) is that it's compatible with any _product structure_ of the observational model.
If the observational model contains two independent measurements then any realized likelihood function will be given by the product of two component likelihood functions. Bayes' Theorem allows us to update the prior distribution into a posterior distribution using both...
...or one at a time, creating an intermediate posterior distribution from the first likelihood function that becomes the prior distribution for the construction of the final posterior distribution from second likelihood function. Either way we get the same posterior distribution!
This associativity of independent measurements also carries over to probability density functions -- we can construct the final posterior density function all at once or in two steps, creating an intermediate posterior density function in the process.
This even applies to Stan programs! If you have a program that implements log pi(y_1 | theta) pi(theta) then you just add log pi(y_2 | theta) to the model block to implement log pi(y_1, y_2 | theta) pi(theta) = log pi(y_1 | theta) + log pi(y_2 | theta) + log pi (theta)!
So what's the problem? Well despite what many introductory classes will have you believe posterior density functions actually aren't all that useful in practice.
In order to extract meaningful inferences from a posterior distribution (normalized or not!) we need to evaluate expectation values of functions, or equivalently integrate those functions against the posterior density function.
I know, you all hate thinking about expectation values but they cannot be avoided if you want to do produce a valid analysis! All well-posed summaries are derived from expectation values, such as the moments and quantiles and histograms that characterize marginal distributions!
The problem is that integrating function is haaaaaaaard, especially when the model configuration space is high dimensional. In most cases we can't do anything analytical and we have to appeal to approximations like the sampling-based approximation Markov chain Monte Carlo.
Markov chain Monte Carlo, and any software that implements a Markov chain Monte Carlo method, is designed to approximate expectation values _and nothing else_. When these approximations are good enough we get a faithful picture of the information contained in our posterior dist.
Sampling-based methods, however, are not compatible with the product structure of the observational model. There's no natural way to transform samples from pi(theta | y_1) into samples from pi(theta | y_1, y_2)!
Indeed designing a way to transform samples from one distribution to another is just as hard -- in most cases actually much harder -- than generating samples from the initial distribution in the first place!
So what are we left to do? All of the methods that are commonly recommended -- estimate moments and then pick a matching density from a fixed family, construct a --shudder-- kernel density estimator, etc -- all introduce serious problems that are hard to diagnose.
In more than a few dimensions all these heuristics can do at best is capture the propagate features of the posterior distribution. Most of the details will be lost in translation, and we won't get the right final posterior distribution anyways.
But here's the thing -- why would you try to approximate a density function with samples _when you already have the density function_? The Stan program used to generate samples from pi(theta | y_1) already implements the density function pi(theta | y_1) up to normalization!
In particular you have a Stan program that implements pi(y_1, theta) and code for pi(y_2 | theta) then appending the model blocks will give code that implements pi(y_1, y_2, theta)! Then you can characterize the final posterior distribution by running Stan.
That's it. That's all you have to do. No density estimators, no moment matching. Just put all of the data together and fit it all once. This even works when the measurements aren't completely independent -- see for example betanalpha.github.io/assets/case_st….
The _only time_ you need to consider updating a computational intermediary like samples is when you can't access previous data, such as when data are streaming and not saved. This requires the heavier machinery of Sequential Monte Carlo which is much more difficult to implement.
Anyways, this is why I'm always going on and on about the foundations of probability theory. Without that scaffolding you can't separate out _what_ you're trying to approximate and _how_ you're approximating it, making it very easy to confuse the two and fall to bad heuristics.
There are a million courses, tutorials, and blog posts out there that promise to teach you all you need to know about Bayesian inference in just a few minutes. In very narrow circumstances those lessons might be relevant, but how do you know when you're in those circumstances?

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with \mathfrak{Michael "Shapes Dude" Betancourt}

\mathfrak{Michael

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @betanalpha

Oct 15, 2024
Maybe it’s asking for a bit too much, but I think it’s reasonable for those dismissing Bayesian methods in favor of frequentist methods toactually be using proper frequentist methods. Buckle in for a long thread about walking the frequentist talk.
Frequentist methods are based on _estimators_. An estimator is just a function that maps data to some numerical output which may or may not be associable with some meaningful property of the system being studied, hence may or may not actually “estimate” it.
For example an estimator might output points in, or subsets of, the relevant model configuration space. A useful estimator would then output values close to the true model configuration, or subsets that contain the the true model configuration.
Read 29 tweets
Mar 6, 2024
You want to know why we keep seeing terrible explanations of p-values? Because concise explanations understandable by a general audience are fundamentally impossible.

Oh yeah it's a thread about communicating p-values.
The proper definition of a p-value — not just what it is but also how it is used — is technical and even in full formality pretty subtle.
It requires the definition of not one but _two_ different statistical models (because we’re never actually rejecting the null model in isolation), a particular test statistic (that has to be designed probability theory on continuous spaces), ...
Read 15 tweets
Feb 8, 2023
Friendly reminder that models at best _approximate_ some true data generating process. A natural question is how well a model approximates the true data generating process, but without actually knowing that truth we can't provide a meaningful answer. 1/n
The best we can do in practice is ask how consistent models are with the brief glimpses of the true data generating process encoded in observations.

The divergence between these questions -- what we want to ask and what we can ask -- is also known as overfitting. 2/n
Bayesian inference goes a step further -- instead of asking how consistent models are with observations it asks how consistent they are with observations _and_ whatever domain expertise we encode in the prior model. 3/n
Read 6 tweets
Jan 6, 2023
It's the first Friday of the new year so let me take the opportunity to complain about the scourge of "default" prior models, meaningless terms like "informative prior", and real harm that they cause.
One of the fundamental challenges of statistical inference is that making and validating assumptions is hard and context dependent. In particular there are no universal assumptions that are adequate in every analysis. There's a reason why "it depends" is a statistical mantra.
But people are _desperate_ to avoid those difficulties, which makes it easy to sell them solutions that claim to side step assumptions entirely. Such claims are so common that people start to develop an expectation that assumptions not only can be avoided but also should be.
Read 35 tweets
Sep 23, 2022
I was just asked an interesting question so why don't we turn it into a little thread? How are we able to do any probabilistic inference on continuous observational spaces when the probability of any single observation is zero?
More formally let's consider a continuous space Y, such as the real numbers. An observational model pi_{theta} is a collection of probability distributions over Y, each indexed by a model configuration, or parameter, theta.
Now any probability distribution on a continuous space will assign zero probability to almost all atomic sets that consist of a single point. Here we'll round up "almost all" to "all" and say that

P_{pi_theta} [ y ] = 0

for all y \in Y and all model configurations theta.
Read 23 tweets
Sep 7, 2022
One of the aspects of Bayesian modeling that continuously brings me joy is the difficulty in hiding from modeling assumptions and their consequences. You don't have to look at them, of course, but a Bayesian analysis makes them _really_ bright.
To be clear accessible modeling assumptions aren't unique to Bayesian inference. Frequentist estimators are calibrated with respect to some assumed model, but when everyone focuses on estimators and takes the calibrations for granted there's little motivation to look.
Moreover plenty of Bayesian tools do their best to hide assumptions as well as any frequentist tool out there. Tools that abstract away models behind default assumptions, or a limited and rigid set of options, make it very easy to use models for which one has no understanding.
Read 18 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(