I wish as a profession we would be more careful about tossing around terms like "endogeneity" -- especially with panel data. For many years, I've been emphasizing that the error consists of two components; I call them c(i) and u(i,t). I always include time dummies, say, f(t). Endogeneity WRT c(i) and f(t) is handled by TWFE. But that leaves u(i,t), the idiosyncratic, time-varying shocks. For that, we generally need IV along with TWFE.

In terms of DiD, the assignment can be correlated with the level in the control state, y_it(0) -- so it can be endog.
Nice stuff! Pedro knows I'm competitive, and now he's thrown down the gauntlet. I'll to have to clean up my shared Dropbox (see pinned tweet). For starters, I finally have a new version of my extended TWFE paper -- posted there. It's shorter and hopefully more to the point. Includes a bunch of equivalences that I've discovered over the past few years -- some recent. And I show that the regression-based "event study" approaches of Sun- Abraham/Callaway-Sant'Anna are the same when S-A includes covariates fully flexibly as with my ETWFE method.
There's a good reason the Frisch-Waugh-Lovell Theorem is taught in intro econometrics, at least at the graduate level. It's used to characterize omitted variable bias as well as the plim of OLS estimators under treatment heterogeneity and also diff-in-diffs. And more. I also teach the 2SLS version of FWL, where exogenous variables, X, are partialled out of the IVs, Z, with endogenous explan vars W. It's important to emphasize that the IV needs to be residualized with respect to X. Let Z" be those residuals. This is the key partialling out.
I think the most commonly used treatment effect estimators when treatment, D, is unconfounded conditional on X, are the following:
1. Regression adjustment.
2. Inverse probability (propensity score) weighting.
3. Augmented IPW.
5. Covariate matching.
6. PS matching. RA, AIPW, and IPWRA all use conditional mean functions; usually linear but can be logit, multinomial logit, exponential, and others.

I like RA because it is straightforward -- even if using logit or Poisson -- and it is easy to obtain moderating effects.
It's been too long since I've made a substantive tweet, so here goes. At the following Dropbox link you can access the slides and Stata files for my recent talk at the Stata UK meeting:

It's taken me awhile to see connections among various… Perhaps even longer to figure out some tricks to make standard error calculation for aggregated, weighted effects easy. I think I've figured out several useful relationships and shortcuts. Ex post, most are not surprising. I didn't have them all in my WP or my nonlinear DiD.
Okay, here goes. T = 2 balanced panel data. D defines treated group, f2_t is the second period dummy, W_t = D*f2_t is the treatment. Y_1 and Y_2 are outcomes in the first and second period. ΔY = Y_2 - Y_1. X are time-constant controls. X_dm = X - Xbar_1 (mean of treated units). Eight equivalent methods:

1. OLS ΔY on 1, D, X, D*X_dm (cross sec)

2. Pooled OLS of Y_t on 1, W_t, W_t*X_dm, D, X, D*X, f2_t, f2_t*X; ATT is coef on W_t (t = 1,2)

3. Random effects estimation with same variables in (2).

4. FE estimation of (2), where D, X, D*X drop out.
I've been asked recently by a few people about using a control function approach along with the Poisson FE estimator with panel data. It turns out there's a simple solution if you're willing to assume a linear first stage.

Use linear FE in the first stage and obtain residuals. Of course, you'd include time dummies.

In the second stage, insert the residuals into an exponential function that includes all variables -- endogenous and exogenous. This is the CF step. Estimate using the Poisson FE estimator.

Time dummies in second stage, too.
Thanks for doing this, Jon. I've been thinking about this quite a bit, and teaching my perspective. I should spend less time teaching, more time revising a certain paper. Here's my take, which I think overlaps a lot with yours. I never thought of BJS as trying to do a typical event study. As I showed in my TWFE-TWMundlak paper, without covariates, BJS is the same as what I called extended TWFE. ETWFE puts in only treatment dummies of the form Dg*fs, s >= g, where Dg is cohort, fs is calendar time.
I sometimes get asked whether, in the context of interventions using DiD methods, whether an "always treated" (AT) group can be, or should be, included. Typically, there are also many units not treated until t = 2 or later. But some are treated at entry and remain treated. The short answer is that these units don't help identify true treatment effects except under strong assumptions. Suppose we have only an AT and never treated (NT) group. Units have a string of zeros or string of ones for the treatment indicator.
Here's a simple result from probability that I'm not sure is widely known. It has important practical implications, particularly for incorporating heterogeneity into models.

Suppose one starts with a "structural" conditional expectation, E(Y|X,U) = g(X,U), where U is unobserved. Usually g(.,.) is parameterized, but, unless the model is additive in U, the parameters may not mean much. We tend these days to focus on average partial effects. So, for example, E[dg(X,U)/dx] when X is continuous. The expectation is over (X,U).
How come Stata doesn't report an R-squared with the "newey" command? In my opinion, the correct answer is (c): no good reason. Supposed "problems" with the R-squared with heterosk or ser correlation seem to be holdovers from old textbooks. There's no unbiased estimator of the pop R^2, so discussing bias really is off base.
Unfortunately, indiscriminate use of the term "fixed effects" to describe any set of mutually exclusive and exhaustive dummy variables seems to be generating confusion about nonlinear models and the incidental parameters problem.

#metricstotheface With panel data, the IPP arises when we try to include unit-specific dummies in a nonlinear model with a small number of time periods. We have few observations per "fixed effects." In other cases, IPP arises if we put in group-specific dummies with small group sizes.
If Y, D (treatment), and Z (IV) are all binary with controls X, to obtain LATE you can use a linear model and estimate by IV:
Y = a + b*D + X*c + Z*(X - Xbar)*d + U
First stage:
D = f + g*Z + X*h + Z*(X - Xbar)*m + V Or look at this recent WP by @TymonSloczynski, @sderyauysal, and me to use separate doubly robust estimates of the numerator and denominator. Can use logit outcome models for Y and D.…
Much focus on Poisson regression (whether for cross section or FE Poisson for panel data) is on its consistency when the conditional mean (almost always assumed to be exponential) is correctly specified. This is its most important feature. A less well known but very important feature is its relative efficiency in the class of robust estimators -- that is, estimators consistent when only the mean is correct. (This requirement rules out MLEs of lots of models, such as NegBin I and NegBin II.)
I've said this often to my students, both at MSU and in short courses:

There are good reasons and bad reasons not to use an estimator. You'll be more convincing as an empirical researcher if you know the difference.

Maybe this suggests a good way to write an exam .... Good reason not to use standard random effects: It assumes heterogeneity is uncorrelated with X.

Bad reason not to use RE (linear model): It requires homoskedasticity and no serial correlation of idiosyncratic errors. (False)
To people who badger empirical researchers using micro-type panel data -- where N is pretty large and T is not -- into computing tests for cross-sectional dependence in the errors: Please stop!

These tests give lots of false positives due to unobserved heterogeneity. This is essentially like testing for cluster correlation using residuals after OLS. Even under random sampling and random assignment -- where we know clustering is not needed -- tests for cluster correlation tests will often reject if there is neglected heterogeneity.
I've been so discombobulated lately that I don't keep track of what's in version of papers and what I include in lectures/teaching. So here's an update on what I've learned about DiD in 2022.

#jwdid (borrowing from @friosavila). 1. The pooled OLS method I proposed, which is the same as TWFE and random effects, is also equivalent to a version of imputation I proposed. That means it is consistent for various ATTs under weak assumptions (but those include no anticipation and parallel trends).
A DiD update. I've been editing my nonlinear DiD paper and I have posted a working paper here:…

It's actually more up to date than the latest version of the linear paper. I've been trying to clean up the Stata do files for both the linear and nonlinear cases. I've learned a lot since last updating -- tricks that make things simpler (in linear and nonlinear cases). I'll pin a new tweet with the Dropbox location.
A problem with specification testing is that it can lead those who are inexperienced to think that empirical work is mostly about applying a slew of specification tests to a particular model and then trying to sort out the findings. This is apparent with linear panel data models, where one sees the Breusch-Pagan test used to choose between POLS and RE; the F test of the unit-specific dummies to choose between POLS and FE; and the Hausman test to choose between RE and FE.
Not sure about that! But here's a first attempt. Suppose I have a control group and G treatment levels. The treatment, W, is in {0,1,2,...,G} is unconfounded conditional on X. Assume the overlap condition 0 < p0(x) = P(W=0|X=x) for all x in Support(X). This isn't a trivial assumption b/c it requires that for and subset of the population as determined by values of x, there are some control units. However, if this isn't true, one can trim the sample -- as in the Crump et al. "Moving the Goalposts" work.
If in a staggered DiD setting I write an equation with a full set of treatment indicators by treated cohort and calendar time, and include c(i) + f(t) (unit and time "fixed effects"), would you still call that a "fixed effects" model? If you answer "yes" then you should stop saying things like "there's a problem with the TWFE 'model'." The modeling is our choice; we choose what to put in x(i,t) when we write

y(i,t) = x(i,t)*b + c(i) + f(t) + u(i,t)

The phrase "TWFE model" refers to c(i) + f(t), right?