what are the relative merits of usual 2SLS, with a linear first stage for w, versus using a probit for w and then using probit fitted values as IVs in the second stage?
Both are consistent under standard identification assumptions. Using a probit first stage could be more efficient. Those are the optimal IVs if (1) Var(u|x,z) is constant and (2) P(w = 1|x,z) = probit. It's consistent without either assumption, just like 2SLS.
As shown by my former student Ruonan Xu, the probit first stage can help with a weak IV problem:
The fitted probit fitted values should be IVs, not regressors. And robust standard errors should be used, as always.
If the slope on w is heterogeneous, 2SLS and IV with probit fitted values identify different "weighted treatment effects." I'd try to be less vague but I'd soon be out of my depth.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
A bit more on clustering. If you observe the entire population and assignment is at the unit level, there is not need to cluster. If assignment is at the group level -- to all units -- cluster at the group level. (Hopefully there are many groups.)
I've used the term "ex-post clustering" to describe obsession with clustering just to do it. You don't cluster individual data at the county, state, or regional level just for the heck of it. One must take a stand on the sampling and assignment schemes.
It's easy to see with formulas for estimating the mean from a population. The clustered standard error is too large because of heterogeneity in the means across groups for cluster correlation.
I've become a believer in always reporting "robust" standard errors. This may seem obvious, but there are nuances. And I'm not talking indiscriminate clustering -- I'll comment on that at some point. Let's start with random sampling from a cross section.
We've discussed quasi-MLE, such as fractional logit and Poisson regression. If E(y|x) is correct, we want standard errors robust to general variance misspecification.
Based on questions I get, it seems there's confusion about choosing between RE and FE in panel data applications. I'm afraid I've contributed. The impression seems to be that if RE "passes" a suitable Hausman test then it should be used. This is false.
I'm trying to emphasize in my teaching that using RE (unless CRE = FE) is an act of desperation. If the FE estimates and the clustered standard errors are "good" (intentionally vague), there's no need to consider RE.
RE is considered when the FE estimates are too imprecise to do much with. With good controls -- say, industry dummies in a firm-level equation -- one might get by with RE. And then choosing between RE and FE makes some sense.
So we agree that, provided y is the variable of interest -- not censored -- a linear model estimated by OLS is a good starting point. But other functional forms can be better, such as logistic if y is binary or fractional, exponential if y is nonnegative.
In many cases one should include the covariates flexibly -- such as squares and interactions. This is especially true in treatment effect contexts. If w is the treatment, interact it with the controls when estimating the average treatment effect.
As @TymonSloczynski showed in his elegant 2020 REStat paper, if d is the treatment, just adding x as in the regression y on d, x can produce a badly biased estimate of the ATE. Interacting d and elements of x is generally better. Same is true for nonlinear regression adjustment.