12,399 views

\mathfrak{Michael "El Muy Muy" Betancourt}

@betanalpha

, 26 tweets, 4 min read

My Authors

https://twitter.com/danilobzdok/status/1265268606155309062

https://twitter.com/danilobzdok/status/1265268606155309062

"Multilevel" model is perhaps the most loaded term in all of statistics; in most use cases it carries with it a surprisingly large number of independent assumptions. A short thread.

https://twitter.com/danilobzdok/status/1265268606155309062

Important caveat: the language used in the applied and theoretical stats literature is inconsistent and terms are often motivated by historical contexts that are no longer relevant. The language I will use in this thread is entirely my own.

"Multilevel" models are used in the context of regression where we want to understand how changes in some known covariates influence the statistical behavior of some unknown variates.

Typically we specific the observational model for the variates with parametric families of probability density functions and then introduce covariate dependence by replacing one or more of the parameters with functions of the covariates,

pi(y | x; phi) = pi(y; f(x, phi)).

I like this construction because it allows us to reason about what behaviors of the variate distribution are correlated with the covariates. The covariates are correlated only with the location of the variates? Then only the location parameter is a function of the covariates.

All of this regression context then motivates looking functions of the form f(x, phi) that might be relevant to a particular data generating process. In particular what functional forms will be useful?

Linear regression approximates f with its Taylor expansions in some neighborhood of the covariates. This is by far the best way to think about linear regression and I will not be taking any questions on that matter...

"Multilevel" models take a different approach. Instead of thinking about an explicit function form they discretize f(x, phi) into separate parameters corresponding to intervals of covariate values,

f_{n} \approx f(x, phi) for x_{n} < x < x_{n + 1}.

This discretized approximation is quite flexible, able to capture a variety of functional behaviors at the expense of losing resolution of the covariate values. It's particularly useful when the x are already discretized.

What are set of discrete f_n parameters called? Are they a factor? Or a level? Honestly I have given up on trying to find consistent terminology that everyone will agree upon.

Anyways, we're nowhere near done.

So far we've discretized the influence of just one covariate, but what if we have multiple covariates? If we discretize more than one covariate then a function like f(x^1, x^2, phi) is characterized by an infinite series of parameters.

The first order behavior assumes that the covariates influence f independently,

f(x^1, x^2, phi) = f(x^1, phi) f(x^2, phi).

Then we can write

f(x^1_n, x^2_m, phi) \approx f^1_n + f^2_m.

The second order behavior assumes that the covariates influence f in pairs,

f(x^1, x^2, x^3, phi) = f(x^1, x^2, phi) * f(x^2, x^3, phi) * f(x^1, x^3, phi).

Then we can write

f(x^1_n, x^2_m, x^3_l) = f^12_n^12 + f^23_n^23 + f^13_n^13.

Here n^ij indexes the pairwise intersections of the covariate discretizations. It's a bit easier to see with pictures but this thread is already too long, and really you should hire me to give a course at your company to see accompanying figures. ;-)

For the mathematically inclined we're treating each discretized covariate group as a vector and expanding the output function as a tower of tensor products,

1 \otimes 2 \otimes 3 = (1 \oplus 2 \oplus 3) \otimes ( (1 \otimes 2) \oplus (2 \otimes 3) \oplus (1 \otimes 3) )...

Beyond the math what are we actually doing? We're assuming that some complex function relationship between a parameter in our model and observed covariates can be decomposed into independent contributions from each discrete covariate value.

If we're feeling particularly saucy then we might add corrections to account for two-way interactions, three-way interactions, etc.

To summarize, the heart of a "multilevel" model is assuming that the covariates influence the rest of the model independently (at least to first order) and then discretizing the covariate influence into a finite number of functions.

I used to call the parameters corresponding to each first-order covariate influence a "level", with "multilevel" corresponding to adding those first order influences together to approximate the total influence, but I'm not sure if that's going to confuse everyone.

Anyways, note that "hierarchal model" has not yet been involved. That's because mathematically there's nothing in this construction that has _required_ hierarchical priors. In practice, however, they are _almost always assumed_.

In particular each group of parameters corresponding to a covariate (sometimes people call these discretization levels...) is given its own hierarchical model to add some dynamics regularization and help fit when all of the covariate intervals aren't well populated.

Ugh. "these discretization levels" -> "these discretizations levels" and "dynamics regularization" -> "dynamical regularization". Autocorrect is murdering me right now.

Now go to software like `lm` and its derivations and "multilevel" implies that the original statistical model for the variate takes the form of a general linear model. In other words "multilevel" presumes hierarchical priors and a general linear model for the variate.

This kind of shorthand is incredibly dangerous, especially after a few generations where the original motivation is lost. By taking all of those assumptions for granted we forget to ask if they're needed or can be replaced with other assumptions better suited to an application.

I much prefer to say "multilevel hierarchical general linear model" to make it clear the entire model that I am assuming and facilitate discussion about whether all of those assumptions are appropriate.

Even better let's stop trying to specify models with loaded terminology entirely and just specify the full model with probabilistic programs that are rich enough to communicate all of the assumptions directly. Is that too much to ask? -fin-

Enjoying this thread?

Try unrolling a thread yourself!

Enjoying this thread?

Try unrolling a thread yourself!

More from @betanalpha see all

Embed code for your website

Did Thread Reader help you today?