My Authors
Read all threads
Why You Really Should Be Thinking About Linear Regression As A Taylor Approximation Volume 488383.
Recall that linear regression can be derived by approximating the model

y ~ normal(f(x), sigma)

with a first-order Taylor approximation of f,

f(x) = f(x_0) + df/dx(x0) * (x - x0)
f(x) = alpha + beta * (x - x0).
This is a decent approximation to f only when the gradients are much larger than the higher-order derivatives, which is true only _far away from optima_. Close to an optimum the gradients vanish and the behavior of f is dominated by the second-order derivatives.
In other words if you've collected covariates x near an optimum of f then you need a _quadratic_ model in the covariates, not a linear one (yes this is still technically a linear regression, but it defines a much better context for how the covariates are being used).
Models linear in the covariates are best for modeling behavior far from equilibrium while models quadratic in the covariates are best for modeling behavior close to equilibrium. Often domain expertise is available to choose between the two especially from the experimental design.
Okay, but _which_ quadratic terms should include? Typical recommendations are to include _interactions_,

beta_{nm} * (x_{n} - x_{n,0} * (x_{m} -x_{m, 0},

that capture the cross derivatives in the Taylor expansion, but we can't forget about the regular quadratic terms!
Whether the interactions x_{n} * x_{m} or the diagonal terms x_{n} * x_{n} dominate is determined by the structure of the optima! Box, Hunter, and Hunter has a fantastic discussion of this. Always assuming that the interactions dominate is not good practice!
Now here's my favorite part -- considering higher-order covariate contributions helps to make it very clear why empirical "standardization" of the covariate is bad bad bad... (The only asymptotics I like are the asymptotics of how quickly the terribleness of bad ideas grows).
When we make a Taylor approximation we have to define _around which covariate values_ we are making the Taylor approximation. The actual covariates should be _perturbations_ around this point, not the raw values themselves.
For models linear in the covariates the difference changes the interpretation of the linear model. If you use perturbations

f(x_0) + df/fx(x_0) * (x - x_0) = alpha + beta * x

then the intercept is the baseline value of f at the approx point and the slope is the local gradient.
But if you use the nominal covariate values then

alpha = f(x_0) + df/dx(x_0) * x_0
beta = df/dx(x_0) * x.

In this case the intercept becomes much less interpretable and in particular harder to build principled prior models for!
But whatever, you say, you don't care about non-terrible prior models for the intercept. Fine, I say, but what happens when you include higher-order covariates? Trying to isolate the nominal covariate interactions requires hiding even more terms to the intercept and slope.
f(x_0) + df/dx(x_0) * (x - x_0) + 1/2 * df^2/dx^2(x0) (x - x0)^2
=
[f(x_0) + df/dx(x_0) * x_0 + 1/2df^2/dx^2(x_0) * x_0^2]
+ [df/dx(x_0) + df^2/dx^2(x_0) * x_0] * x
+ 1/2 * df^2/dx^2(x_0) * x^2

Yuck, good luck interpreting that.
The mess gets even worse when you empirically standardize the first-order and second-order covariates independently (which happens automatically in so many software packages). How you get _inconsistent_ subtractions for the first-order and higher-order terms!
That's beyond the typical problems of the empirical standardization being derived from an arbitrary data set that won't be relevant for future data sets even though you _have_ to keep using that same standardization for all future data sets to ensure self-consistent predictions.
Thinking about the geometry of Taylor approximations when using (general) linear models will help you build principled prior models, determine when you need to include higher-order covariates, and implement self-consistent standardizations based on a principled choice of x_0.
Thanks for reading. This long rant was brought to you by a weird dream I had last night that somehow involved higher-order covariate terms in a linear regression.
Missing some Tweet in this thread? You can try to force a refresh.

Enjoying this thread?

Keep Current with \mathfrak{Michael "El Muy Muy" Betancourt}

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!