Because of a recent post at Data Colada, I've been asked about my take on the various heterosk-robust standard errors. In the taxonomy of MacKinnon-White and Davidson-MacKinnon, there's HC0, HC1, HC2, HC3.
HC0 was the original variance matrix estimator proposed in White (1980, Econometrica). HC1 = [n/(n-k)]*HC0 makes a simple df adjustment. Clearly, HC1 - HC0 is positive semi-definite (even PD).
HC2 divides the squared resids, u^(i)^2, by 1 - h(i,i) where the h(i,i) are diag elements from the "hat" or projection matrix. It can be shown that this produces n different unbiased estimators of sigma^2 under homoskedasticity.
HC3 uses [u^(i)^2]/{[1 - h(i,i)]^2} and is closely related to the jackknife estimator. Because 0 < h(i,i) < 1, it is easy to see that HC3 - HC2 is PSD.
All four estimators, when properly scaled by the sample size, are consistent for Avar[sqrt(n)*(beta^ - beta)].
Simulation studies show that HC2 and HC3 lead to better -- with small n, possibly much better -- confidence intervals than HC1.
The datacolada analysis shows that using HC1 [the Stata default with vce(robust)] and HC3 can be very different when sample sizes are not large.
The QJE study cited used HC1 when comparing with randomization inference in experiments. But HC3 turns out to work quite well even with pretty small n. It seems like a good idea to use HC2 or HC3. It won't make much difference if n is large.
In thinking about this, it is natural to wonder whether
HC2 - HC1 is PSD. To me, it's not obvious looking at formulas. If we know it's true then we would always have, in terms of standard errors,
HC0 <= HC1 <= HC2 <= HC3
As I said, the first and third are easy to show.
I can't stand not knowing about whether the middle inequality is true. So I did a simulation with n = 200, k = 3, and 1,000,000 replications. So, 3 million standard errors. And every time, HC1 <= HC2. Another proof by Stata!
I did some other simulations, too. So out of about 5 million chances, HC1 > HC2 never holds.
So, hotshots in linear algebra: Please provide a proof that HC2 - HC1 is PSD! I suspect it's out there but I've checked the obvious econometrics sources.
Oh, and I'm willing to share co-authorship of the Economics Letters paper. 😬🤓
Here's the summary of what I know:
HC0 <= HC1
HC0 <= HC2
HC2 <= HC3
Ambiguous: HC1 vs HC2 and HC1 vs HC3.
With k/n small, HC1 tends to be smaller than HC2.
In my shared Dropbox folder I added a folder, het_robust, with a Stata program that simulates data, computes HC1, HC2, and HC3, and reports the fraction of times HC1 > HC2, HC1 > HC3.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
On my shared Dropbox folder, pinned at the top, I posted the latest version of my TWFE/TWMundlak paper. It's essentially complete (and too long ...). I've included the "truly marvelous" proof of equivalence between pooled OLS and imputation.
I also fixed some of the material on testing/correcting for heterogeneous trends. A nice result is that the POLS approach with cohort-specific trends is the same as the obvious imputation approach.
This means that using the full regression to correct for non-paralled trends suffers no contamination when testing. It's identical to using only untreated obs to test for pre-trends. But one must allow full heterogen in cohort/time ATTs for the equiv to hold.
Fortunately, the speculations I made in my linear DiD paper about extension to the nonlinear case turn out to be true -- with a small caveat. One should use the canonical link function for chosen quasi-log-likelihood (QLL) function.
So, exponential mean/Poisson QLL if y >= 0.
Logistic mean/Bernoulli QLL if 0 <= y <= 1 (binary or fractional). (We call this logit and fractional logit.)
Linear mean, normal (OLS, of course).
These choices ensure that pooled estimation and imputation are numerically identical.
It's not a coincidence that these same combos show up in my work on doubly robust estimation of treatment effects and improving efficiency without sacrificing consistency in RCTs. Latest on the latter is here:
I finally got my TWFE/Mundlak/DID paper in good enough shape to make it an official working paper. I'll put it in other places but it's currently here:
I changed the title a bit to better reflect it's contents. I'm really happy with the results, less happy that the paper got a bit unwieldy. It's intended to be a "low hanging fruit" DID paper.
Now I've more formally shown that the estimator I was proposing -- either pooled OLS or TWFE or RE (they're all the same, properly done) identifies every dynamic treatment one is interested in (on means) in a staggered design.
For my German friends: What is the German equivalent of "Ms." when addressing a woman (not yet a Dr.)? I noticed on a course application form in English -- I assume translated from German -- only two choices, "Mr." and "Mrs." Is "Frau" used for both Mrs. and Ms.?
As a follow-up: If I use English, I assume "Ms." is acceptable. I never address anyone as "Mrs." in English. It's interesting that "Frau" was translated as "Mrs." rather than "Ms." I would've expected the latter, especially in an academic setting.
My formal German courses were in the 1970s, and I learned that "Frau" is for married women only. I think I can make the adjustment, though. 🤓
I'm still intrigued that there is no "Ms." equivalent in German ....
Here's a panel DID question. Common intervention at t=T0. Multiple pre-treatment and post-treatment periods. Dummy d(i) is one if a unit is eventually treated. p(t) is one for t >= T0. Treatment indicator is w(i,t) = d(i)*p(t). Time constant controls are x(i).
I should admit that my tweets and poll about missing data were partly self serving, as I'm interested about what people do. But it was a mistake to leave the poll initially vague. I haven't said much useful on Twitter in some time, so I'll try here.
I want to start with the very simple case where there is one x and I'm interested in E(y|x); assume it's linear (for now). Data are missing on x but not on y. Here are some observations.
1. If the data are missing as a function of x -- formally, E(y|x,m) = E(y|x) -- the CC estimator is consistent (even conditionally unbiased). 2. Imputing on the basis of y is not and can be badly biased. 3. Inverse probability weighting using 1/P(m=0|y) also is inconsistent.