Speaking of two-way FE, it's been under fire for the last few years for estimating treatment effects in DID designs -- especially staggered designs. As many on here know. As an older person, I don't let go of my security blankets so easily.
Certainly the simple TWFE estimator that estimates a single coefficient can be misleading. We know this thanks to recent work of several talented econometricians (you know who you are). But maybe we're just not being flexible enough with treatment heterogeneity.
Now when I teach panel data interventions, I start with basic TWFE but note that, with multiple treatment periods and different entry times, we can easily include interactions that allow for many different average treatment effects (on the treated).
The ATTs can vary by exposure (cohort) and calendar date. For example, if we have 4 entry times with irreversibility, we estimate 4 + 3 + 2 + 1 = 10 different effects rather than one. These identify the ATTs for the different exposure levels and time periods.
Not surprisingly, identification requires no anticipation and common trends. I dabbled with this a bit in my 2005 REStat paper, but I didn't do a full analysis of what one can identify with different treatment patterns.
When we introduce covariates -- so that CT holds conditional on covariates as in Callaway and Sant'Anna -- we get further flexibility. With four entry periods and one covariate here are 14 additional interactions.
When the covariates are centered about exposure-specific means, the ATTs for each exposure/time period are easily gotten. With 4 control periods and 4 treatment periods and just a single X, the TWFE includes 4 + 10 + 10 regressors (not including FE dummies).
Why am I not abondoning the TWFE framework? I'm getting old and I'm lazy. But also I know FE has resiliency to unbalanced panels. It has bias on the order of 1/T when strict exogeneity is violated. Estimating unit-specific trends, as in my 2005 REStat, is a clear extension.
So I know that, with multiple pre-treatment periods, I can remove unit-specific trends to at least partly relax the common trends assumption. Another reason for studying FE: the equivalence with the Mundlak regression suggests strategies for nonlinear models.
I'm trying to finish a draft of what seems like mostly an expository paper, with the thrilling title "Two-Way Fixed Effects, the Two-Way Mundlak Regression, and Difference-in-Differences Estimation." Oh, and I'm preparing for an interview with @causalinf.
A sample (and simple) Stata command with T = 4, two treated periods (3 and 4), staggered, one x:
xtreg y c.e3#c.d2013 c.e3#c.d2014 c.e4#c.d2014 c.e3#c.d2013#c.x_dm3 c.e3#c.d2014#c.x_dm3 c.e4#c.d2014#c.x_dm4 d2013 d2014 c.d2013#c.x c.d2014#c.x, fe vce(cluster id)
I expect I'm about to be taught some things. One is never too old for that ....
The coefficients on the first three terms are the estimated TEs. The ATT for cohort first exposed in 2013 during 2013, the effect for that cohort in 2014, and the effect for cohort first exposed in 2014 during 2014.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
There's a good reason the Frisch-Waugh-Lovell Theorem is taught in intro econometrics, at least at the graduate level. It's used to characterize omitted variable bias as well as the plim of OLS estimators under treatment heterogeneity and also diff-in-diffs. And more.
I also teach the 2SLS version of FWL, where exogenous variables, X, are partialled out of the IVs, Z, with endogenous explan vars W. It's important to emphasize that the IV needs to be residualized with respect to X. Let Z" be those residuals. This is the key partialling out.
Then apply 2SLS to any of the equations
Y = W*b + U1
Y" = W*b + U2
Y" = W"*b + U3
Y = W"*b + U4
using IVs Z".
All four deliver the 2SLS estimates of b on the full equation Y = X*a + W*b + U with IVs (X,Z). All " variables have X partialled out from them.
I think the most commonly used treatment effect estimators when treatment, D, is unconfounded conditional on X, are the following: 1. Regression adjustment. 2. Inverse probability (propensity score) weighting. 3. Augmented IPW. 4. IPWRA 5. Covariate matching. 6. PS matching.
RA, AIPW, and IPWRA all use conditional mean functions; usually linear but can be logit, multinomial logit, exponential, and others.
I like RA because it is straightforward -- even if using logit or Poisson -- and it is easy to obtain moderating effects.
But, technically, RA requires correct specification of the conditional means E[Y(d)|X] for consistency.
IPW uses only specification of the PS. We now know we should use normalized weights. IPW can be sensitive to overlap problems because p^(X) can be close to one or zero.
It's been too long since I've made a substantive tweet, so here goes. At the following Dropbox link you can access the slides and Stata files for my recent talk at the Stata UK meeting:
Perhaps even longer to figure out some tricks to make standard error calculation for aggregated, weighted effects easy. I think I've figured out several useful relationships and shortcuts. Ex post, most are not surprising. I didn't have them all in my WP or my nonlinear DiD.
The talk is only about regression-based methods, but includes logit and Poisson regression (and even other nonlinear models). In the linear case, slide 28 shows a "very long regression." I was tempted to call it something like the "grand unified regression."
Okay, here goes. T = 2 balanced panel data. D defines treated group, f2_t is the second period dummy, W_t = D*f2_t is the treatment. Y_1 and Y_2 are outcomes in the first and second period. ΔY = Y_2 - Y_1. X are time-constant controls. X_dm = X - Xbar_1 (mean of treated units).
2. Pooled OLS of Y_t on 1, W_t, W_t*X_dm, D, X, D*X, f2_t, f2_t*X; ATT is coef on W_t (t = 1,2)
3. Random effects estimation with same variables in (2).
4. FE estimation of (2), where D, X, D*X drop out.
Imputation versions of each:
5. OLS ΔY on 1 X using D = 0. Get residuals TE^_FD. Average TE^_FD over treated units.
6. POLS of Y_t on 1, D, X, D*X, f2_t, f2_t*X using W_t = 0 (control obs). TE_t^_POLS resids. ATT is average of TE_t^_POLS over W_t = 1 (treated observations)
I've been asked recently by a few people about using a control function approach along with the Poisson FE estimator with panel data. It turns out there's a simple solution if you're willing to assume a linear first stage.
Use linear FE in the first stage and obtain residuals.
Of course, you'd include time dummies.
In the second stage, insert the residuals into an exponential function that includes all variables -- endogenous and exogenous. This is the CF step. Estimate using the Poisson FE estimator.
Time dummies in second stage, too.
One generally needs to adjust standard errors. That can be done by bootstrapping both stages or setting up as a joint GMM problem. Under the null that the coeff on the CF is zero (exogeneity with respect to shocks), a usual cluster-robust t test (or Wald test) is valid.
Thanks for doing this, Jon. I've been thinking about this quite a bit, and teaching my perspective. I should spend less time teaching, more time revising a certain paper. Here's my take, which I think overlaps a lot with yours.
I never thought of BJS as trying to do a typical event study. As I showed in my TWFE-TWMundlak paper, without covariates, BJS is the same as what I called extended TWFE. ETWFE puts in only treatment dummies of the form Dg*fs, s >= g, where Dg is cohort, fs is calendar time.
ETWFE is derivable from POLS using cohort dummies, which derives directly from imposing and using all implications of parallel trends. That's why it's relatively efficient under the traditional assumptions. To me, this is the starting point.