It's the weekend so I'm going to rant a little bit about degenerate inferences, why they're so common in machine learning, and why they can't just be ignored. A cathartic thread.
A concept that's popular in machine learning and academic statistics is _universality_, or degrees thereof. As in universal function approximators or models that capture every possible data generating process over an observation space.
Mathematically the idea is to have models that are incredibly flexible -- lots and lots of parameters, little regularization -- so that with infinite data you can recover any signal/truth. It helps that this asymptotic limits facilitates a lot of mathematical analyses.
But the flexibility is a double-edged sword. A model that's flexible enough to fit almost any signal asymptotically will be _too_ flexible for almost any _finite_ observation. An outright profane amount of model configurations will be consistent with any finite data set.
The more flexible the model, and hence more universal it seems asymptotically, the worse this degeneracy will be preasymptotically. For example for any fixed data size the degeneracies will tend to cover more expansive areas of model configurations.
Alternatively the more flexible the model the more data will be needed to achieve any fixed concentration of consistent model configurations into a finite neighborhood.
Importantly these degeneracies don't manifest as spherical fuzziness concentrating around any particular point but rather complex, twisted surfaces that extend far across the model configuration space.
The stronger the degeneracy the less any single model configuration represents all of the model configurations compatible with any given, finite observation. Optimization will often find _a_ reasonable solution, but only by ignoring many others that are almost just as good.
Unfortunately quantifying _all_ of the model configurations consistent with the observed data, and any available domain expertise, is really really important for decision making as the optimal decision are often very sensitive to the model configurations.
This is fine when trying to caption images -- all you need is one good caption, not all of them -- but it becomes critical when informing policy -- a medical treatment that works for a few patients but not most of them is a disaster. Same for social and economic policy.
The frustrating thing is that none if this is particularly hard to see -- hell it's encoded in much of the computational folk lore of machine learning and statistics. Sensitivity to initial conditions? Multiple modes? The need for heuristic regularization?
Honestly I don't take anyone talking about Bayesian neural networks these days seriously. Try to run MCMC and pay attention to the recommended diagnostics and you'll start to see just how nasty, and hard to quantify, neural network likelihood functions are in practice.
Stop complaining that MCMC is slow when your model is ridiculously degenerate. MCMC is like a gas -- it will fill the available space, or at least try to. If there's too much space then this will take forever regardless of how efficient your Markov transitions are.
Oh, but variational Bayes is so much faster? Yeah, because variational Bayes implicitly projects your degenerate model onto a much simpler model space where all of those universally guarantees disappear. Not that anyone will ever actually acknowledge that.
Again all of that can be fine if you're dealing with applications with trivial consequences. But once your analysis will effect real people it's no longer ethical to pretend that data analysis is a fucking engineering task and completely ignore the assumptions you are making.
I once attended a workshop where a famous computer scientist boasted about how his topic model -- a notorious degenerate family of models -- identified a classification of medical pathologies that clinicians found reasonable, and that it was going to revolutionize medicine.
And all of those other model configurations that his variational Bayes fit ignored that the clinicians would also have found reasonable? The ones that would suggest very different interventions and treatments? No, those weren't important to consider.
Ultimately we get this perfect storm of models designed for asymptotic flexibility and model evaluations that focus on point estimates, which in practice manifests in extremely degenerate models and evaluation procedures that ignore that degeneracy entirely.
And now people in machine learning are patting themselves on the back for starting to talk about these problems while continuing to ignore the applied statisticians and practitioners screaming about these consequences forever? <fights growing aneurysm>
I know that due to dwindling resources and increasing competition academia, and some related industries, have become incredibly cutthroat and people have to fight for funding and recognition. I know that it's harder and harder to play that game and fulfill long held ambitions.
But when you start to externalize the costs of sloppy methodologies, pushing hype and ignoring important consequences to fight for your share of the increasing-meager pie? I know my personal opinion means little but I have zero respect for people who make that compromise.
Anyways, that's my weekend rant. I know that I shouldn't expect much better but I'm an idiot and even after all of this time I'm still frustrated, disappointed, and angry.
Shout out to the exceptional people on the front lines fighting against these trends, those refuse to ignore the consequent and toil under-funded and under-appreciated to do what is needed. I hope that my work helps improve your struggle, even if only marginally.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Keep Current with \mathfrak{Michael "El Muy Muy" Betancourt}


Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!


Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @betanalpha

6 Sep
One of the most frustrating aspects of the computational statistics literature is the varying convention for which assumptions will be implicitly assumed or not, making it near impossible to read a paper out the context of that convention. Let's talk about one bad consequence.
Like many other fields comp stats has evolved to be very provincial, with different perspectives evolving in different communities. For example in Markov chain Monte Carlo theory you can see the differences between work from the UK, Duke, Minnesota, etc.
Each of those perspectives give rise to their own conventions for notation, terminology, and, most importantly, assumptions. Papers written by that community for that community will often leave those assumptions implicit making them hard to read by those outside of the group.
Read 18 tweets
18 Jun
Just because it bothers me so much, a short thread on why a false discover rate is not sufficient to make a binary decision.
A common motivation in model comparison is identifying which of two statistical models is most consistent with a given observation. Let's call one the "null hypothesis" and the other the "alternative hypothesis".
In certain completely arbitrary scientific fields the null hypothesis might be called the "background model" and the alternative hypothesis would be the "background plus signal model". Just an example...
Read 19 tweets
28 May
Friendly reminder that from a math perspective probabilities in logistic regression are almost exactly the same as velocities in special relativity. If you understand log odds ratios then you secretly know the basics of special relativity! That may be why the former is so hard...
Another way of thinking about it -- logistic probabilities add in almost exactly the same weird way that relativistic velocities do. But everything adds _approximately_ linearly around p = 1/2 and v/c = 0. Keep that in mind the next time you use a linear probability model...
For the non-physicists who want to impress their friends: the axes x^- and x^+ in the right plot are light-cone coordinates and the straight lines are the trajectories of objects moving at constant velocities. w is the rapidity, an unconstrained relativistic velocity.
Read 4 tweets
26 May
"Multilevel" model is perhaps the most loaded term in all of statistics; in most use cases it carries with it a surprisingly large number of independent assumptions. A short thread.
Important caveat: the language used in the applied and theoretical stats literature is inconsistent and terms are often motivated by historical contexts that are no longer relevant. The language I will use in this thread is entirely my own.
"Multilevel" models are used in the context of regression where we want to understand how changes in some known covariates influence the statistical behavior of some unknown variates.
Read 26 tweets
15 May
Intersection of physics and probabilistic computation story time! These coin falling toys demonstrate both conservation of angular momentum and why funnel-shaped densities are hard to fit with Hamiltonian Monte Carlo.
As the coin spirals down potential gravitational energy is converted to kinetic energy -- the coin falls and accelerates. Because angular momentum is conserved the shape of the spiral is constrained; as the coin gets faster the radius of the spiral has to decrease proportionally.
The exact trajectory is ultimately determined by the shape of the funnel, and how the normal force that can be exerted on the coin interacts with all of these conserved energies and momenta.
Read 19 tweets
20 Apr
One of the delightful insights in Good's "Good Thinking" is why collecting data until you achieve significance is fundamentally doomed to fail. I'm pretty sure this isn't a novel perspective, but it's the first I had seen it explained so clearly.
Under a point null hypothesis tail probabilities past some threshold (i.e. p-values) are uniformly distributed no matter the model. Consequently the p-values corresponding to a sequence of increasing measurements will be uniformly distributed marginally at each iteration.
At the same time because the data are growing at each iteration the p-values will be correlated with previous p-values. In other words the sequence of p-values forms a Markov chain (of some order) whose stationary distribution is that uniform distribution.
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!