Discover and read the best of Twitter Threads about #metricstotheface

Most recents (24)

Unfortunately, indiscriminate use of the term "fixed effects" to describe any set of mutually exclusive and exhaustive dummy variables seems to be generating confusion about nonlinear models and the incidental parameters problem.

#metricstotheface
With panel data, the IPP arises when we try to include unit-specific dummies in a nonlinear model with a small number of time periods. We have few observations per "fixed effects." In other cases, IPP arises if we put in group-specific dummies with small group sizes.
But if we include, say, occupation dummies when we have lots of people in each occupation, this clearly causes no problem. Or, including interviewer "fixed effects" when we have lots of subjects per interviewer.
Read 5 tweets
Not sure about that! But here's a first attempt. Suppose I have a control group and G treatment levels. The treatment, W, is in {0,1,2,...,G} is unconfounded conditional on X. Assume the overlap condition 0 < p0(x) = P(W=0|X=x) for all x in Support(X).
This isn't a trivial assumption b/c it requires that for and subset of the population as determined by values of x, there are some control units. However, if this isn't true, one can trim the sample -- as in the Crump et al. "Moving the Goalposts" work.
If overlap holds and conditional means are linear, the following regression recovers the ATTs of each group g relative to control:

Y on 1, W1, W2, ... WG, X, W1*(X - Xbar1), W2*(X - Xbar2), ..., WG*(X - XbarG) where Xbarg is the sample average of treatment group g.
Read 10 tweets
If in a staggered DiD setting I write an equation with a full set of treatment indicators by treated cohort and calendar time, and include c(i) + f(t) (unit and time "fixed effects"), would you still call that a "fixed effects" model?
If you answer "yes" then you should stop saying things like "there's a problem with the TWFE 'model'." The modeling is our choice; we choose what to put in x(i,t) when we write

y(i,t) = x(i,t)*b + c(i) + f(t) + u(i,t)

The phrase "TWFE model" refers to c(i) + f(t), right?
If x(i,t) = w(i,t) -- a single treatment indicator -- then the model might be too restrictive. But as I've shown in my DiD work, it's easy to put more in x(i,t) and estimate a full set of heterogeneous TEs. But I can (and should) still use the TWFE estimator.
Read 6 tweets
Not exactly. I like Bruce's approach in this paper and it yields nice insights. But in twitter and private exchanges last week, and what I've learned since, it seems that the class of estimators in play in Theorem 5 include only estimators that are linear in Y.

#metricstotheface
Theorem 5 is correct and neat, but leaves open the question of which estimators are in the class that is being compared with OLS. Remember, we cannot simply use phrases such as "OLS is BUE" without clearly defining the competing class of estimators. This is critical.
The class of distributions in F2 is so large -- only restricting the mean to be linear in X and assuming finite second moments -- that it's not surprising the class of unbiased estimators is "small." So small, it is estimators linear in Y.
Read 11 tweets
Concerning the recent exchange many of us had about @BruceEHansen's new Gauss-Markov Theorem, I now understand a lot more and can correct/clarify several things I wrote yesterday. I had a helpful email exchange with Bruce that confirmed my thinking.

#metricstotheface
A lot was written about the "linear plus quadratic" class of estimators as possible competitors to OLS. Here's something important to know: Bruce's result does not allow these estimators in the comparison group with OLS unless they are actually linear; no quadratic terms allowed.
If one looks at Theorem 5 concerning OLS, you'll see a distinction between F2 and F2^0. All estimators in the comparison group must be unbiased under the very large class of distributions, F2. This includes all distributions with finite second moments -- so unrestricted SIGMA.
Read 13 tweets
One of the remarkable features of Bruce's result, and why I never could have discovered it, is that the "asymptotic" analog doesn't seem to hold. Suppose we assume random sampling and in the population specify

A1. E(y|x) = x*b0
A2. Var(y|x) = (s0)^2

#metricstotheface
Also assume rank E(x'x) = k so no perfect collinearity in the population. Then OLS is asymptotically efficient among estimators that only use A1 for consistency. But OLS is not asymp effic among estimators that use A1 and A2 for consistency.
A2 adds many extra moment conditions that, generally, are useful for estimating b0 -- for example, if D(y|x) is asymmetric with third central moment depending on x. So there are GMM estimators more asymp efficient than OLS under A1 and A2.
Read 5 tweets
Here's an example I use in the summer ESTIMATE course at MSU. It's based on an actual contingent valuation survey. There are two prices, one of regular apples the other of "ecologically friendly" apples. The prices were randomly assigned as a pair, (PR, PE).

#metricstotheface
Individuals were then asked to choose a basket of regular and eco-friendly applies. A linear regression for QE (quantity of eco-labeled) gives very good results: strong downward sloping demand curve, an increase in the competing price shifts out the demand curve.
Now, the prices were generated to be highly correlated with, corr = 0.83. Not VIF > 10 territory but a pretty high correlation. If PR is dropped from the equation for QE, the estimated price effect for PE falls dramatically -- because there's an important omitted variable, PR.
Read 5 tweets
If you know people who teach students it's important to "test" for multicollinearity, please ask them why.

I imagine a world where the phrase "I tested for multicollinearity" no longer appears in published work. I know John Lennon would be on my side.

#metricstotheface
What I'm getting at is that it's still common to see "tests" for multicollinearity without even looking at the regression output. Or asking which variables are collinear. Often it's control variables. So what? If you have many control variables you might have to select.
And a VIF of 9.99 is okay but 10.01 is a disaster? We can do better than this across all fields.

I just saw a post where X1 and X2 have a correlation of .7, and the researcher wonders which variable to drop.
Read 6 tweets
A Twitter primer on the canonical link the linear exponential family. I've used this combination in a few of my papers: the doubly robust estimators for estimating average treatment effects, improving efficiency in RCTs, and, most recently, nonlinear DiD.

#metricstotheface
The useful CL/LEF combinations are:
1. linear mean/normal
2. logistic mean/Bernoulli (binary fractional)
3. logistic mean/binomial (0 <= Y <= M)
4. exponential mean/Poisson (Y >= 0)
5. logistic means/multinomial

The last isn't used very much -- yet.
The key statistical feature of the CL/LEF combinations is that the first order conditions look like those for OLS (combination 1). The residuals add to zero and each covariate is uncorrelated with the residuals in sample. Residuals are uhat(i) y(i) - mhat(x(i)).
Read 5 tweets
Because of a recent post at Data Colada, I've been asked about my take on the various heterosk-robust standard errors. In the taxonomy of MacKinnon-White and Davidson-MacKinnon, there's HC0, HC1, HC2, HC3.

#metricstotheface

datacolada.org/99
HC0 was the original variance matrix estimator proposed in White (1980, Econometrica). HC1 = [n/(n-k)]*HC0 makes a simple df adjustment. Clearly, HC1 - HC0 is positive semi-definite (even PD).
HC2 divides the squared resids, u^(i)^2, by 1 - h(i,i) where the h(i,i) are diag elements from the "hat" or projection matrix. It can be shown that this produces n different unbiased estimators of sigma^2 under homoskedasticity.
Read 12 tweets
On my shared Dropbox folder, pinned at the top, I posted the latest version of my TWFE/TWMundlak paper. It's essentially complete (and too long ...). I've included the "truly marvelous" proof of equivalence between pooled OLS and imputation.

#metricstotheface
I also fixed some of the material on testing/correcting for heterogeneous trends. A nice result is that the POLS approach with cohort-specific trends is the same as the obvious imputation approach.
This means that using the full regression to correct for non-paralled trends suffers no contamination when testing. It's identical to using only untreated obs to test for pre-trends. But one must allow full heterogen in cohort/time ATTs for the equiv to hold.
Read 8 tweets
Fortunately, the speculations I made in my linear DiD paper about extension to the nonlinear case turn out to be true -- with a small caveat. One should use the canonical link function for chosen quasi-log-likelihood (QLL) function.

#metricstotheface
So, exponential mean/Poisson QLL if y >= 0.
Logistic mean/Bernoulli QLL if 0 <= y <= 1 (binary or fractional). (We call this logit and fractional logit.)
Linear mean, normal (OLS, of course).

These choices ensure that pooled estimation and imputation are numerically identical.
It's not a coincidence that these same combos show up in my work on doubly robust estimation of treatment effects and improving efficiency without sacrificing consistency in RCTs. Latest on the latter is here:

scholar.google.com/citations?view…
Read 4 tweets
I finally got my TWFE/Mundlak/DID paper in good enough shape to make it an official working paper. I'll put it in other places but it's currently here:

researchgate.net/publication/35…

Also, the Stata stuff is still with the Dropbox link:

dropbox.com/sh/zj91darudf2…

#metricstotheface
I changed the title a bit to better reflect it's contents. I'm really happy with the results, less happy that the paper got a bit unwieldy. It's intended to be a "low hanging fruit" DID paper.
Now I've more formally shown that the estimator I was proposing -- either pooled OLS or TWFE or RE (they're all the same, properly done) identifies every dynamic treatment one is interested in (on means) in a staggered design.
Read 16 tweets
Here's a panel DID question. Common intervention at t=T0. Multiple pre-treatment and post-treatment periods. Dummy d(i) is one if a unit is eventually treated. p(t) is one for t >= T0. Treatment indicator is w(i,t) = d(i)*p(t). Time constant controls are x(i).

#metricstotheface
Consider several estimators of the avg TE [coef on w(i,t)]. Period dummies are f2(t), ... fT(t).

1. Pooled OLS: y(i,t) on w(i,t), 1, d(i), p(t)
2. TWFE including w(i,t).
3. POLS: y(i,t) on w(i,t) 1, d(i), p(t), x(i)
4. POLS: y(i,t) on w(i,t) 1, d(i), f2(t), ... fT(t), x(i)
For a balanced panel without degeneracies, which is the correct statement?
Read 5 tweets
I should admit that my tweets and poll about missing data were partly self serving, as I'm interested about what people do. But it was a mistake to leave the poll initially vague. I haven't said much useful on Twitter in some time, so I'll try here.

#metricstotheface
I want to start with the very simple case where there is one x and I'm interested in E(y|x); assume it's linear (for now). Data are missing on x but not on y. Here are some observations.
1. If the data are missing as a function of x -- formally, E(y|x,m) = E(y|x) -- the CC estimator is consistent (even conditionally unbiased).
2. Imputing on the basis of y is not and can be badly biased.
3. Inverse probability weighting using 1/P(m=0|y) also is inconsistent.
Read 7 tweets
Several comments on this paper. First, it's nice to see someone taking the units of measurement issue seriously. But I still see many issues, especially when y >= 0 and we have better alternatives.

1. A search is required over units of measurement.

#metricstotheface
How do a compute a legitimate standard error of, say, an elasticity? I've estimated theta but then I ignore the fact that I estimated it? That's not allowed.

2. As with many transformation models, the premise is there exists a transformation g(.) such that g(y) = xb + u.
u is assumed to be indep of x, at a minimum. Often the distrib is restricted. In 1989 in an IER paper I argued this was a real problem with Box-Cox approaches b/c u >= -xb. If I model E(y|x) directly I need none of that. It's what Poisson regression does.
Read 6 tweets
If I'm interested in a treatment effect on a variable y, I want E(y1|x) - E(y0|x) or E(y1|x)/E(y0|x). It's pretty clear I generally cannot obtain it from a model for g(y) for some transformation g(.) -- whether it's log(y), log(1+y), arcsinh(y), log(y/(1+y)).

#metricstotheface
Simulations that show maybe g(.) can be undone without too much harm are necessarily special; they always make strong assumptions about independence and distribution: often, errors are independent of x and normally distributed. I can find cases where using g(.) is disastrous.
When we have simple theorems that provide better alternatives to transformations we should use them. For y >= 0 it's hard to beat Poisson regression. For 0 <= y <=1 it's hard to beat frac logit. If simulations were sufficient then econometrics would almost cease to be a field.
Read 3 tweets
A year ago on Facebook, at the request of a former MSU student, I made this post. I used to say in class that econometrics is not so hard if you just master about 10 tools and apply them again and again. I decided I should put up or shut up.

#metricstotheface
I cheated by combining tools that are connected, so there are actually more than 10 ....
1. Law of Iterated Expectations, Law of Total Variance
2. Linearity of Expectations, Variance of a Sum
3. Jensen's Inequality, Chebyshev’s Inequality
4. Linear Projection and Its Properties
5. Weak Law of Large Numbers, Central Limit Theorem
6. Slutksy's Theorem, Continuous Convergence Theorem, Asymptotic Equivalence Lemma
7. Big Op, Little op, and the algebra of them.
Read 5 tweets
Have we yet figured out when we should include a lagged dependent variable in either time series or panel data models when the goal is to infer causality? (For forecasting the issue is clear.) Recent work on macroeconomics on causal effects is a positive sign.

#metricstotheface
And the answer cannot be, "Include y(t-1) if it is statistically significant." Being clear about potential outcomes and the nature of the causal effects we hope to estimate are crucial. I need to catch up on this literature and I need to think more.
In panel data settings, if our main goal is to distinguish state dependence from heterogeneity, clearly y(t-1) gets included. But what if our interest is in a policy variable? Should we hold fixed y(t-1) and the heterogeneity when measuring the policy effect?
Read 6 tweets
Tests that should be retired from empirical work:

Durbin-Watson statistic.
Jarque-Bera test for normality.
Breusch-Pagan test for heteroskedasticity.
B-P test for random effects.
Nonrobust Hausman tests.

I feel uncomfortable using names, but this is #metricstotheface.
D-W test only gives bounds. More importantly, it maintains the classical linear model assumptions.
J-B is an asymptotic test. If we can use asymptotics then normality isn't necessary.
B-P test for heteroskedasticity: maintains normality and constant conditional 4th moment.
B-P test for RE: maintains normality and homoskedasticity but, more importantly, detects any kind of positive serial correlation.
Nonrobust Hausman: maintains unnecessary assumptions under the null that conflict with using robust inference. Has no power to test those assumps.
Read 5 tweets
Historically, economics has fallen into the bad habit of thinking the fancier estimation method is closer to being "right" -- based on one sample of data. We used to think this of GLS vs OLS until we paid careful attention to exogeneity assumptions.

#metricstotheface
We used to think this of 2SLS vs OLS until problems with weak instruments were revealed.

We still seem to think matching methods are somehow superior to regression adjustment if they give somewhat different estimates.
And we now seem to think ML methods are to be preferred over basic methods. Perhaps a reckoning is on its way.
Read 4 tweets
I wonder whether machine learning is becoming to economics what hierarchical models are in other fields: as soon as someone says, "I used (fill in your favorite ML method) to select controls (or IVs)," the audience is supposed to nod approvingly and be quiet.

#metricstotheface
And I know there is still plenty of discussion to be had about unconfoundedness, exogeneity and strength of instruments, and so on. But I'm becoming more suspicious about whether ML really deliver in selecting controls and IVs. Only blind competitions will tell us.
Even worse would be if ML becomes a de facto requirement for empirical work in cases where its benefits are questionable -- or even when ML might be harmful.
Read 3 tweets
Yesterday I was feeling a bit guilty about not teaching lasso, etc. to the first-year PhD students. I'm feeling less guilty today. How much trouble does one want to go through to control for squares and interactions for a handful of control variables?

#metricstotheface
And then it gets worse if I want my key variable to interact with controls. You can't select the variables in the interactions using lasso. I just looked at an application in an influential paper and a handful of controls, some continuous, were discretized.
Discretizing eliminates the centering problem I mentioned, but in a crude way. So I throw out information by arbitrarily using five age and income categories so I can use pdslasso? No thanks.
Read 8 tweets
I was reminded of the issue of centering IVs when creating an example using PDS LASSO to estimate a causal effect. The issue of centering controls before including them in a dictionary of nonlinear terms seems it can be important.

#metricstotheface
The example I did included age and income as controls. Initially I included age, age^2, inc^2, age*inc. PDSLASSO works using a kind of Frisch-Waugh partialling out, imposing sparsity on the controls.
But as we know from basic OLS, not centering before creating squares and interactions can make main effects weird -- with the "wrong" sign and insignificant. This means in LASSO they might be dropped.
Read 6 tweets

Related hashtags

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!