I've decided to share a Dropbox folder containing a recent paper -- a sort of "pre-working" paper -- on panel data estimators for DID/event studies. I'm "in between" web pages (and could use recommendations on a simple, effective platform).
The paper starts with algebraic equivalence results -- hence the somewhat odd title -- and applies those to interventions with common entry time and staggered entry. I think it's useful to see the equivalence between TWFE with lots of heterogeneity and pooled OLS equivalents.
I think of it as a parametric regression adjustment version of Callaway and Sant'Anna (but using levels rather than differences) And, as in Sun and Abraham, I make a connection with TWFE (while allowing for covariates).
Speaking of two-way FE, it's been under fire for the last few years for estimating treatment effects in DID designs -- especially staggered designs. As many on here know. As an older person, I don't let go of my security blankets so easily.
Certainly the simple TWFE estimator that estimates a single coefficient can be misleading. We know this thanks to recent work of several talented econometricians (you know who you are). But maybe we're just not being flexible enough with treatment heterogeneity.
Now when I teach panel data interventions, I start with basic TWFE but note that, with multiple treatment periods and different entry times, we can easily include interactions that allow for many different average treatment effects (on the treated).
More on LPM versus logit and probit. In my teaching, I revisited a couple of examples: one using data from the Boston Fed mortgage approval study; the other using a balanced subset of the "nonexperimental" data from Lalonde's classic paper on job training.
In both cases, the key explanatory variable is binary: an indicator being "white" in the Fed study (outcome: mortgage approved?), a job training participation indicator in the Lalonde study (outcome: employed after program?)
In just adding binary indicator alone, the probit, logit, linear give similar stories but the estimates of the average treatment effects do differ. In the Lalonde case by 4 percentage points (19 vs 22 vs 23, roughly).
So, I decide to practice what I (and many others) preach ....
A somewhat common device in panel data models is to lag explanatory variables when they're suspected as being "endogenous." It often seems to be done without much thought, as if lagging solves the problem and we can move on. I have some thoughts about it.
First, using lags changes the model -- and it doesn't always make sense. For example, I wouldn't lag inputs in a production function. I wouldn't lag price in a demand or supply function. In other cases, it may make sense to use a lag rather than the contemporaneous variable.
Under reasonable assumptions, the lag, x(i,t-1) is sequential exogenous (predetermined). You are modeling a certain conditional expectation. But, logically, it cannot be strictly exogenous. Therefore, fixed effects estimation is inconsistent with fixed T, N getting large.
In 2018 I was invited to give a talk at SOCHER in Chile, to give my opinions about using spatial methods for policy analysis. I like the idea of putting in spatial lags of policy variables to measure spillovers. Use fixed effects with panel data, compute fully robust ses.
For the life of me, I couldn't figure out how putting in spatial lags of Y had any value. After preparing a course in July 2020, I was even more negative about this practice. It seems an unnecessary complication developed by theorists.
As far as I can tell, when spatial lags in Y are used, one always computes the effects of own policy changes and neighbor policy changes, anyway, by solving out. This is done much more robustly and much more easily modeling spillovers directly without spatial lags in Y.
I taught a bit of GMM for cross-sectional data the other day. In the example I used, there was no efficiency gain in using GMM with a heteroskedasticity-robust weighting matrix over 2SLS. I was reminded of the presentation on GMM I gave 20 years ago at ASSA.
The session was organized by the AEA, and papers were published in the 2001 JEL issue "Symposium on Econometric Tools." Many top econometricians gave talks, and I remember hundreds attended. (It was a beautiful audience. The largest ever at ASSA. But ASSA underreported the size.)
In my talk I commented on how, for standard problems -- single equation models estimated with cross-sectional data, and even time series data -- I often found GMM didn't do much, and using 2SLS with appropriately robust standard errors was just as good.
I think frequentists and Bayesians are not yet on the same page, and it has little to do with philosophy. It seems some Bayesians think a proper response to clustering standard errors is to specify an HLM. But in the linear case, HLM leads to GLS, not OLS.
Moreover, a Bayesian would take the HLM structure seriously in all respects: variance and correlation structure and distribution. I'm happy to use an HLM to improve efficiency over pooled estimation, but I would cluster my standard errors, anyway. A Bayesian would not.
There still seems to be a general confusion that fully specifying everything and using a GLS or joint MLE is a costless alternative to pooled methods that use few assumptions. And the Bayesian approach is particular unfair to pooled methods.
In such cases, we have a clear tradeoff between consistency and efficiency.
In models additive in endogenous explanatory variables with constant coefficients, CF reduces to 2SLS or FE2SLS -- which is neat. Of course, the proof uses Frisch-Waugh.
The equivalence between CF and 2SLS implies a simple, robust specification test of the null that the EEVs are actually exogenous. One can use "robust" or Newey-West or "cluster robust" very easily. The usual Hausman test is not robust, and suffers from degeneracies.
If you teach prob/stats to first-year PhD students, and you want to prepare them to really understand regression, go light on measure theory, counting, combinatorics, distributions. Emphasize conditional expectations, linear projections, convergence results.
This means, of course, law of iterated expectations, law of total variance, best MSE properties of CEs and LPs. How to manipulate Op(1) and op(1). Slutsky's theorem. Convergence in distribution. Asymptotic equivalence lemma. And as much matrix algebra as I know.
If you're like me -- and barely understand basic combinatorics -- you'll also be happier. I get the birthday problem and examples of the law of very large numbers -- and that's about it.
When I teach regression with time series I emphasize that even if we use GLS (say, Prais-Winsten), we should make standard errors robust to serial correlation (and heteroskedasticity). Just like with weighted least squares.
Both are consistent under standard identification assumptions. Using a probit first stage could be more efficient. Those are the optimal IVs if (1) Var(u|x,z) is constant and (2) P(w = 1|x,z) = probit. It's consistent without either assumption, just like 2SLS.
As shown by my former student Ruonan Xu, the probit first stage can help with a weak IV problem:
A bit more on clustering. If you observe the entire population and assignment is at the unit level, there is not need to cluster. If assignment is at the group level -- to all units -- cluster at the group level. (Hopefully there are many groups.)
I've used the term "ex-post clustering" to describe obsession with clustering just to do it. You don't cluster individual data at the county, state, or regional level just for the heck of it. One must take a stand on the sampling and assignment schemes.
It's easy to see with formulas for estimating the mean from a population. The clustered standard error is too large because of heterogeneity in the means across groups for cluster correlation.
I've become a believer in always reporting "robust" standard errors. This may seem obvious, but there are nuances. And I'm not talking indiscriminate clustering -- I'll comment on that at some point. Let's start with random sampling from a cross section.
Based on questions I get, it seems there's confusion about choosing between RE and FE in panel data applications. I'm afraid I've contributed. The impression seems to be that if RE "passes" a suitable Hausman test then it should be used. This is false.
I'm trying to emphasize in my teaching that using RE (unless CRE = FE) is an act of desperation. If the FE estimates and the clustered standard errors are "good" (intentionally vague), there's no need to consider RE.
RE is considered when the FE estimates are too imprecise to do much with. With good controls -- say, industry dummies in a firm-level equation -- one might get by with RE. And then choosing between RE and FE makes some sense.
So we agree that, provided y is the variable of interest -- not censored -- a linear model estimated by OLS is a good starting point. But other functional forms can be better, such as logistic if y is binary or fractional, exponential if y is nonnegative.
In many cases one should include the covariates flexibly -- such as squares and interactions. This is especially true in treatment effect contexts. If w is the treatment, interact it with the controls when estimating the average treatment effect.
As @TymonSloczynski showed in his elegant 2020 REStat paper, if d is the treatment, just adding x as in the regression y on d, x can produce a badly biased estimate of the ATE. Interacting d and elements of x is generally better. Same is true for nonlinear regression adjustment.