I finally got my TWFE/Mundlak/DID paper in good enough shape to make it an official working paper. I'll put it in other places but it's currently here:
I changed the title a bit to better reflect it's contents. I'm really happy with the results, less happy that the paper got a bit unwieldy. It's intended to be a "low hanging fruit" DID paper.
Now I've more formally shown that the estimator I was proposing -- either pooled OLS or TWFE or RE (they're all the same, properly done) identifies every dynamic treatment one is interested in (on means) in a staggered design.
I formalized the no anticipation and common trend assumptions to make them simple and comparable to the literature. The method involves lots of regressors but it's linear. Plus, it is both BLUE and asymp. effic. under a stand no ser correl/homosk assumption.
Combining efffects and testing restrictions on them is as simple as the Stata "test" and "lincom" commands. If you by NA and conditional parallel trends, linearity of the cond means in the covariates is the only shortcoming.
Unlike other previous work, I show what is being identified by the POLS/ETWFE estimator when there is no never treated group. It's quite neat under a natural extension of the PT assumption. In the final period, you get the ATTs of being treated before the final period.
In other approaches, such as Callaway and Sant'Anna, no effects are estimated in the final period. Using POLS/ETWFE, only one parameter is lost: there is not control group for the cohort treated in the final unit.
In preparing a DID course, I discovered some other neat stuff. When you impose PT and allow flexible interactions with covariates the way I do, the POLS estimator is numerically identical to the comparable imputation estimator in Borusyak, Jaravel, Spiess (2021). @borusyak
“Revisiting Event Study Designs: Robust
and Efficient Estimation." I've only partly proven this algebraically. Some may remember my reference to a "proof by Stata" a couple of weeks ago .... POLS makes it trivial to obtain valid inference when N is large, T not too large.
Since BJS showed efficiency of their estimator (albeit under diff assumps), it's ex post not too surprising the methods coincide in the PT case. Could we have two BLUEs?
The POLS approach extends naturally to nonlinear models, but I'm writing a separate paper on that.
Another important point is that, once we know POLS uses the NA/PT assumptions efficiently, we can look for better ways to do doubly robust estimation. Methods that use long differences are inefficient.
Love the Callaway/Sant'Anna work, but long differences throw out useful information, especially in later periods when there are many potential controls or with many pre-treatment periods. @pedrohcgs
POLS uses all valid control units for each cohort/year combo.
A first simulation shows Callaway/Sant'Anna SDs 30% larger than POLS. The bias of each is negligible. Yes, it was set up with linear cond means. And I used csdid in Stata.
There's more: I show how PT can be easily tested while allowing for flexible covariates.
Simply robust Wald tests using the Stata "test" command. It does not get around the pre-testing problem (@jondr44) but it makes it transparent: it's the age old problem of pre-testing a set of regressors.
If the controls are supposed to induce PT, condition on them in testing.
As I suspected would happen, things have come full circle for me. Take the disgraced TWFE estimator, add full heterogeneous effects, allow hetero in covariates across cohort, time, and treatment. Combine and test as you wish. It's incredibly simple, flexible; can be effic.
With decent overlap I'm not too worried about the linear conditional expectation. Don't mess with OLS.
Oh, wait, I am worried about linearity -- that's why I'm working on nonlinear extensions (binary, fractional, nonnegative).
• • •
Missing some Tweet in this thread? You can try to
force a refresh
For my German friends: What is the German equivalent of "Ms." when addressing a woman (not yet a Dr.)? I noticed on a course application form in English -- I assume translated from German -- only two choices, "Mr." and "Mrs." Is "Frau" used for both Mrs. and Ms.?
As a follow-up: If I use English, I assume "Ms." is acceptable. I never address anyone as "Mrs." in English. It's interesting that "Frau" was translated as "Mrs." rather than "Ms." I would've expected the latter, especially in an academic setting.
My formal German courses were in the 1970s, and I learned that "Frau" is for married women only. I think I can make the adjustment, though. 🤓
I'm still intrigued that there is no "Ms." equivalent in German ....
Here's a panel DID question. Common intervention at t=T0. Multiple pre-treatment and post-treatment periods. Dummy d(i) is one if a unit is eventually treated. p(t) is one for t >= T0. Treatment indicator is w(i,t) = d(i)*p(t). Time constant controls are x(i).
I should admit that my tweets and poll about missing data were partly self serving, as I'm interested about what people do. But it was a mistake to leave the poll initially vague. I haven't said much useful on Twitter in some time, so I'll try here.
I want to start with the very simple case where there is one x and I'm interested in E(y|x); assume it's linear (for now). Data are missing on x but not on y. Here are some observations.
1. If the data are missing as a function of x -- formally, E(y|x,m) = E(y|x) -- the CC estimator is consistent (even conditionally unbiased). 2. Imputing on the basis of y is not and can be badly biased. 3. Inverse probability weighting using 1/P(m=0|y) also is inconsistent.
Several comments on this paper. First, it's nice to see someone taking the units of measurement issue seriously. But I still see many issues, especially when y >= 0 and we have better alternatives.
1. A search is required over units of measurement.
How do a compute a legitimate standard error of, say, an elasticity? I've estimated theta but then I ignore the fact that I estimated it? That's not allowed.
2. As with many transformation models, the premise is there exists a transformation g(.) such that g(y) = xb + u.
u is assumed to be indep of x, at a minimum. Often the distrib is restricted. In 1989 in an IER paper I argued this was a real problem with Box-Cox approaches b/c u >= -xb. If I model E(y|x) directly I need none of that. It's what Poisson regression does.
A year ago on Facebook, at the request of a former MSU student, I made this post. I used to say in class that econometrics is not so hard if you just master about 10 tools and apply them again and again. I decided I should put up or shut up.
I cheated by combining tools that are connected, so there are actually more than 10 .... 1. Law of Iterated Expectations, Law of Total Variance 2. Linearity of Expectations, Variance of a Sum 3. Jensen's Inequality, Chebyshev’s Inequality 4. Linear Projection and Its Properties
5. Weak Law of Large Numbers, Central Limit Theorem 6. Slutksy's Theorem, Continuous Convergence Theorem, Asymptotic Equivalence Lemma 7. Big Op, Little op, and the algebra of them.
Have we yet figured out when we should include a lagged dependent variable in either time series or panel data models when the goal is to infer causality? (For forecasting the issue is clear.) Recent work on macroeconomics on causal effects is a positive sign.
And the answer cannot be, "Include y(t-1) if it is statistically significant." Being clear about potential outcomes and the nature of the causal effects we hope to estimate are crucial. I need to catch up on this literature and I need to think more.
In panel data settings, if our main goal is to distinguish state dependence from heterogeneity, clearly y(t-1) gets included. But what if our interest is in a policy variable? Should we hold fixed y(t-1) and the heterogeneity when measuring the policy effect?