It seems like every week, if not more frequently, I learn something new about a basic estimation method -- OLS, 2SLS, and offshoots. My students seem skeptical when I tell them this but it's true.
This week: centering before creating squares and interactions.
Now, I've taught this in the context of OLS and 2SLS for a long time, and it comes up a lot in my introductory book. It's often needed to give main effects a sensible interpretation -- whether those are exogenous or endogenous, whether it's a pooled method or FE.
But one case where I've been too cavalier is with creating instruments out of squares and interactions of exogenous variables when, say, the structural equation includes w*xj where w is endogenous and xj is exogenous. We can use xj*zh as IVs.
I used to tell my students there's no need to center before creating xj*zh because they only show up in first stage linear projections and the 2SLS estimates are invariant. That claim is true, but there's still value in centering squares and interactions of IVs. Why?
Because we should be looking at the first stage regressions, and centering in creating the IVs makes the pattern of which exogenous variables are most relevant for which endogenous variables much easier to discern.
If I project w onto x1, z1, x1*z1 I could easily get a small and insignificant coefficient on z1, making me think I have a weak IV when in fact it's quite strong. If w*(x1 - x1bar) is also in the model then it will be more informative to project it onto (x1 - x1bar)*(z1 - z1bar).
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Yesterday I was feeling a bit guilty about not teaching lasso, etc. to the first-year PhD students. I'm feeling less guilty today. How much trouble does one want to go through to control for squares and interactions for a handful of control variables?
And then it gets worse if I want my key variable to interact with controls. You can't select the variables in the interactions using lasso. I just looked at an application in an influential paper and a handful of controls, some continuous, were discretized.
Discretizing eliminates the centering problem I mentioned, but in a crude way. So I throw out information by arbitrarily using five age and income categories so I can use pdslasso? No thanks.
I was reminded of the issue of centering IVs when creating an example using PDS LASSO to estimate a causal effect. The issue of centering controls before including them in a dictionary of nonlinear terms seems it can be important.
The example I did included age and income as controls. Initially I included age, age^2, inc^2, age*inc. PDSLASSO works using a kind of Frisch-Waugh partialling out, imposing sparsity on the controls.
But as we know from basic OLS, not centering before creating squares and interactions can make main effects weird -- with the "wrong" sign and insignificant. This means in LASSO they might be dropped.
I've decided to share a Dropbox folder containing a recent paper -- a sort of "pre-working" paper -- on panel data estimators for DID/event studies. I'm "in between" web pages (and could use recommendations on a simple, effective platform).
The paper starts with algebraic equivalence results -- hence the somewhat odd title -- and applies those to interventions with common entry time and staggered entry. I think it's useful to see the equivalence between TWFE with lots of heterogeneity and pooled OLS equivalents.
I think of it as a parametric regression adjustment version of Callaway and Sant'Anna (but using levels rather than differences) And, as in Sun and Abraham, I make a connection with TWFE (while allowing for covariates).
Speaking of two-way FE, it's been under fire for the last few years for estimating treatment effects in DID designs -- especially staggered designs. As many on here know. As an older person, I don't let go of my security blankets so easily.
Certainly the simple TWFE estimator that estimates a single coefficient can be misleading. We know this thanks to recent work of several talented econometricians (you know who you are). But maybe we're just not being flexible enough with treatment heterogeneity.
Now when I teach panel data interventions, I start with basic TWFE but note that, with multiple treatment periods and different entry times, we can easily include interactions that allow for many different average treatment effects (on the treated).
More on LPM versus logit and probit. In my teaching, I revisited a couple of examples: one using data from the Boston Fed mortgage approval study; the other using a balanced subset of the "nonexperimental" data from Lalonde's classic paper on job training.
In both cases, the key explanatory variable is binary: an indicator being "white" in the Fed study (outcome: mortgage approved?), a job training participation indicator in the Lalonde study (outcome: employed after program?)
In just adding binary indicator alone, the probit, logit, linear give similar stories but the estimates of the average treatment effects do differ. In the Lalonde case by 4 percentage points (19 vs 22 vs 23, roughly).
So, I decide to practice what I (and many others) preach ....
A somewhat common device in panel data models is to lag explanatory variables when they're suspected as being "endogenous." It often seems to be done without much thought, as if lagging solves the problem and we can move on. I have some thoughts about it.
First, using lags changes the model -- and it doesn't always make sense. For example, I wouldn't lag inputs in a production function. I wouldn't lag price in a demand or supply function. In other cases, it may make sense to use a lag rather than the contemporaneous variable.
Under reasonable assumptions, the lag, x(i,t-1) is sequential exogenous (predetermined). You are modeling a certain conditional expectation. But, logically, it cannot be strictly exogenous. Therefore, fixed effects estimation is inconsistent with fixed T, N getting large.