Intersection of physics and probabilistic computation story time! These coin falling toys demonstrate both conservation of angular momentum and why funnel-shaped densities are hard to fit with Hamiltonian Monte Carlo.
As the coin spirals down potential gravitational energy is converted to kinetic energy -- the coin falls and accelerates. Because angular momentum is conserved the shape of the spiral is constrained; as the coin gets faster the radius of the spiral has to decrease proportionally.
The exact trajectory is ultimately determined by the shape of the funnel, and how the normal force that can be exerted on the coin interacts with all of these conserved energies and momenta.
In particular, because of the conservation of angular momentum spirals will be confined to a relatively narrow band of heights. The coin can't fall too far without adding more angular momentum! On a perfectly frictionless surface the coin would settle into a stable orbit.
In reality friction and collisions between the coin and imperfections in the funnel surface dissipate energy, allowing the coin to fall without having to store up too much kinetic energy.
Now typical implementations of Hamiltonian Monte Carlo, technically the ones with Gaussian-Euclidean cotangent disintegrations, i.e. constant "mass matrices"? The trajectories they generate are mathematically equivalent to a frictionless particle in a certain physical system.
The physical system corresponding to "funnel" target density functions, like those that arise in latent Gaussian models, is pretty much equivalent to the spiraling coin system. In particular our Hamiltonian Monte Carlo trajectories have the same height restriction!
No matter how long we integrate the trajectories can go only so far up and down the funnel. Only by resampling the momenta between trajectories that we can add or remove energy and move further up or down the funnel. But this process is slow, leading to diffusive exploration.
In higher dimensions the diffusion gets even slower -- resampling the momenta is much more likely to add energy than remove it, making it more likely to move up higher in the funnel then down deeper into it. That's why it takes forever to explore the neck of the funnel.
Keep in mind that this is a property of the exact trajectories and so we get slow exploration _even with perfect integrators_. When we use numerical integrators we have to deal with more problems like divergences, but they're layered on top of these fundamental issues.
Gaussian-Riemannian cotangent disintegrations, i.e. varying "mass matrices", require a log determinant normalization term. Conveniently this acts like an energy reservoir in the physical system, soaking up energy to allow the particle to quickly drop to the bottom of the funnel.
That's why so-called "Riemannian Hamiltonian Monte Carlo" is so much better equipped to fit hierarchical models. It's not so much the numerical integration that's better, it's the actual geometry of the Hamiltonian trajectories!
Unfortunately "Riemannian" Hamiltonian Monte Carlo is a giant pain to implement efficiently, and even harder to implement automatically (which is why it's not exposed in Stan). Fortunately we can exactly emulate that better geometry by non-centering the funnel!
This equivalence between reparameterizations of the target space and better geometries for Hamiltonian Monte Carlo is the subject of my last geometry paper, arxiv.org/abs/1910.09407. It's a bit technical but there are lots of pictures!
I also wrote much more about the trials and tribulations of Hamiltonian trajectories in the funnel in arxiv.org/abs/1312.0906. This includes lots of pictures of both good and bad trajectories.
But why limit ourselves to pictures when we can have _movies_? This is a typical "Euclidean" trajectory. Notice how it bounces within a narrow band of heights.
On the other hand the "energy reservoir" in "Riemannian" Hamiltonian Monte Carlo allows trajectories to span huge differences in heights by absorbing and releasing energy as needed.
ANYWAYS. Hierarchical models in centered parameterizations are hard to fit not just because of divergences but also due to fundamental constraints on the trajectories.
In this case we can build up some intuition why using a relatively familiar physical analogy, but most of the time pathologies in Hamiltonian Monte Carlo fits are much more sophisticated so we shouldn't try to lean on physical analogies _too_ much! -fin-

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with \mathfrak{Michael "Shapes Dude" Betancourt}

\mathfrak{Michael

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @betanalpha

Mar 6
You want to know why we keep seeing terrible explanations of p-values? Because concise explanations understandable by a general audience are fundamentally impossible.

Oh yeah it's a thread about communicating p-values.
The proper definition of a p-value — not just what it is but also how it is used — is technical and even in full formality pretty subtle.
It requires the definition of not one but _two_ different statistical models (because we’re never actually rejecting the null model in isolation), a particular test statistic (that has to be designed probability theory on continuous spaces), ...
Read 15 tweets
Feb 8, 2023
Friendly reminder that models at best _approximate_ some true data generating process. A natural question is how well a model approximates the true data generating process, but without actually knowing that truth we can't provide a meaningful answer. 1/n
The best we can do in practice is ask how consistent models are with the brief glimpses of the true data generating process encoded in observations.

The divergence between these questions -- what we want to ask and what we can ask -- is also known as overfitting. 2/n
Bayesian inference goes a step further -- instead of asking how consistent models are with observations it asks how consistent they are with observations _and_ whatever domain expertise we encode in the prior model. 3/n
Read 6 tweets
Jan 6, 2023
It's the first Friday of the new year so let me take the opportunity to complain about the scourge of "default" prior models, meaningless terms like "informative prior", and real harm that they cause.
One of the fundamental challenges of statistical inference is that making and validating assumptions is hard and context dependent. In particular there are no universal assumptions that are adequate in every analysis. There's a reason why "it depends" is a statistical mantra.
But people are _desperate_ to avoid those difficulties, which makes it easy to sell them solutions that claim to side step assumptions entirely. Such claims are so common that people start to develop an expectation that assumptions not only can be avoided but also should be.
Read 35 tweets
Sep 23, 2022
I was just asked an interesting question so why don't we turn it into a little thread? How are we able to do any probabilistic inference on continuous observational spaces when the probability of any single observation is zero?
More formally let's consider a continuous space Y, such as the real numbers. An observational model pi_{theta} is a collection of probability distributions over Y, each indexed by a model configuration, or parameter, theta.
Now any probability distribution on a continuous space will assign zero probability to almost all atomic sets that consist of a single point. Here we'll round up "almost all" to "all" and say that

P_{pi_theta} [ y ] = 0

for all y \in Y and all model configurations theta.
Read 23 tweets
Sep 7, 2022
One of the aspects of Bayesian modeling that continuously brings me joy is the difficulty in hiding from modeling assumptions and their consequences. You don't have to look at them, of course, but a Bayesian analysis makes them _really_ bright.
To be clear accessible modeling assumptions aren't unique to Bayesian inference. Frequentist estimators are calibrated with respect to some assumed model, but when everyone focuses on estimators and takes the calibrations for granted there's little motivation to look.
Moreover plenty of Bayesian tools do their best to hide assumptions as well as any frequentist tool out there. Tools that abstract away models behind default assumptions, or a limited and rigid set of options, make it very easy to use models for which one has no understanding.
Read 18 tweets
Jul 21, 2022
One of my constant frustrations is people taking "factor of two"/"order of magnitude"/"back of the envelope" calculations too seriously based on the implicit rationalization that the accuracy of the outputs should be similar to the accuracy of the inputs. A short thread.
These calculations necessarily convolve all of the input uncertainties together which typically results in a more uncertain output. In fact the output uncertainty is often surprisingly large relative to our naive expectations.
To demonstrate let's look at Fermi's classic "piano tuners" problem which is nicely described in grc.nasa.gov/www/k-12/Numbe…. We'll also use the calculation there for convenience.
Read 20 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(