Not exactly. I like Bruce's approach in this paper and it yields nice insights. But in twitter and private exchanges last week, and what I've learned since, it seems that the class of estimators in play in Theorem 5 include only estimators that are linear in Y.
Theorem 5 is correct and neat, but leaves open the question of which estimators are in the class that is being compared with OLS. Remember, we cannot simply use phrases such as "OLS is BUE" without clearly defining the competing class of estimators. This is critical.
The class of distributions in F2 is so large -- only restricting the mean to be linear in X and assuming finite second moments -- that it's not surprising the class of unbiased estimators is "small." So small, it is estimators linear in Y.
Stephen Portnoy, @lihua_lei_stat, and I have independently come to this conclusion (I've had exchanges with both and with Bruce). So the statement in Theorem 5 concerning the class of estimators is the same as saying estimators linear in Y.
If we entertain estimators that are unbiased when the full GM assumptions are used -- so that Var(Y|X) = (s^2)*I -- then OLS is not best unbiased. Interestingly, when we add normality, by the C-R lower bound, shows OLS is best unbiased in a very large class of estimators.
This goes to show that efficiency is not a "monotonic" function as we relax restrictions on the class of estimators.
To show my summary is wrong, you need to find a nonlinear estimator that is unbiased for b for any distribution in F2. Portnoy has shown it's not possible.
@lihua_lei_stat and I have argued the same thing using results mentioned here:
Here's a more succinct way to state my conclusion: 1. OLS is BUE in the class of unbiased estimators E2, defined by the the distributions F2 in Hansen. 2. E2 includes only linear estimators.
But this is what we mean when we say OLS is BLUE.
In the original GM Theorem, the class of estimators is explicitly stated to be linear and unbiased. In Hansen Theorem 5, the class of estimators is implicitly linear and unbiased. But the conclusions about OLS are the same.
Here's another observation that may help. We must separate the class of estimators we're willing to entertain from the assumptions under which we claim efficiency. It's easy to confound them. It's desirable to have estimators unbiased under A1 only.
Such estimators are necessarily linear. Then, OLS is best in this class under Assumptions A1 and A2. If we want to consider estimators unbiased under A1, A2, and A3 then there is no ambiguity: OLS is best in this class under A1, A2, and A3.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Concerning the recent exchange many of us had about @BruceEHansen's new Gauss-Markov Theorem, I now understand a lot more and can correct/clarify several things I wrote yesterday. I had a helpful email exchange with Bruce that confirmed my thinking.
A lot was written about the "linear plus quadratic" class of estimators as possible competitors to OLS. Here's something important to know: Bruce's result does not allow these estimators in the comparison group with OLS unless they are actually linear; no quadratic terms allowed.
If one looks at Theorem 5 concerning OLS, you'll see a distinction between F2 and F2^0. All estimators in the comparison group must be unbiased under the very large class of distributions, F2. This includes all distributions with finite second moments -- so unrestricted SIGMA.
This is neat and makes sense to me. After all, third moments need not even exist under GM. And using 3rd moments would make it very tough to achieve unbiasedness across all cases with only GM. Clearly, the result says it's impossible
It still blows my mind that that OLS is best unbiased in that class. Across all multivariate distributions with weird 3rd and 4th conditional moments, and beyond. As I said in a previous tweet, this would not be true in an asymptotic setting.
The Koopmann result prompts a question that I've wondered about off and on. If you use the first 3 GM assumptions, which I write as
A1. Y = X*b + U
A2. rank(X) = k
A3. E(U|X) = 0
then, for A n x k, a linear estimator A'Y is unbiased if and only if A'X = I (n x n).
One of the remarkable features of Bruce's result, and why I never could have discovered it, is that the "asymptotic" analog doesn't seem to hold. Suppose we assume random sampling and in the population specify
Also assume rank E(x'x) = k so no perfect collinearity in the population. Then OLS is asymptotically efficient among estimators that only use A1 for consistency. But OLS is not asymp effic among estimators that use A1 and A2 for consistency.
A2 adds many extra moment conditions that, generally, are useful for estimating b0 -- for example, if D(y|x) is asymmetric with third central moment depending on x. So there are GMM estimators more asymp efficient than OLS under A1 and A2.
Here's an example I use in the summer ESTIMATE course at MSU. It's based on an actual contingent valuation survey. There are two prices, one of regular apples the other of "ecologically friendly" apples. The prices were randomly assigned as a pair, (PR, PE).
Individuals were then asked to choose a basket of regular and eco-friendly applies. A linear regression for QE (quantity of eco-labeled) gives very good results: strong downward sloping demand curve, an increase in the competing price shifts out the demand curve.
Now, the prices were generated to be highly correlated with, corr = 0.83. Not VIF > 10 territory but a pretty high correlation. If PR is dropped from the equation for QE, the estimated price effect for PE falls dramatically -- because there's an important omitted variable, PR.
What I'm getting at is that it's still common to see "tests" for multicollinearity without even looking at the regression output. Or asking which variables are collinear. Often it's control variables. So what? If you have many control variables you might have to select.
And a VIF of 9.99 is okay but 10.01 is a disaster? We can do better than this across all fields.
I just saw a post where X1 and X2 have a correlation of .7, and the researcher wonders which variable to drop.
A Twitter primer on the canonical link the linear exponential family. I've used this combination in a few of my papers: the doubly robust estimators for estimating average treatment effects, improving efficiency in RCTs, and, most recently, nonlinear DiD.
The useful CL/LEF combinations are: 1. linear mean/normal 2. logistic mean/Bernoulli (binary fractional) 3. logistic mean/binomial (0 <= Y <= M) 4. exponential mean/Poisson (Y >= 0) 5. logistic means/multinomial
The last isn't used very much -- yet.
The key statistical feature of the CL/LEF combinations is that the first order conditions look like those for OLS (combination 1). The residuals add to zero and each covariate is uncorrelated with the residuals in sample. Residuals are uhat(i) y(i) - mhat(x(i)).