Noah Haber Profile picture
Sep 12, 2020 13 tweets 4 min read Read on X
STATS QUIZ!

I have the datapoints below. Nothing hidden, no tricks, just a bunch of data making roughly an ellipse.

In your head, draw what you think the ordinary least squares line (i.e. good ol' y= mx+b) line looks like for these data.
Is this what you drew?
Seems "obvious" right?

Except that's not the OLS line.

The red dashed line is the OLS line.

What's going on here?
Ordinary least squares is the line which minimizes the VERTICAL squared distances.

Take a vertical line on the data, and draw a midpoint between the top and bottom of the ellipse. Do that for all points.

OLS is the line that minimizes the sum of those (squared) distances.
In your head, you were probably trying to find the line that minimizes the TOTAL distance, not just the distance on the vertical.

That's the ORTHOGONAL (or total) least squares line. Totally natural, and it isn't wrong. Makes you wonder why we care only about the vertical, eh?
RABBIT HOLE TIME!

What we are implying here is that Y is the DOMINANT axis.

That's great and fine if we are trying to predict Y from X, or estimate how much X causes Y.

But you've heard the phrase "it's just association" right?
If all we care about is that X and Y move together, we shouldn't have a dominant axis at all! We'd probably want to use something more like TLS.

In other words, the form of our estimates pushes us toward prediction/causation thinking, even if we really don't want to.
Small clarification OLS is (was a bit unclear):

The OLS line is the line in which the sum of the squared vertical distances from the line and the datapoints are minimized.

The midpoint bit is an illustration for intuition, not strictly how OLS actually "works"
The above thread was inspired by one from @kareem_carr here:

For more thought provoking stuff on axis dominance and graphical displays, check out @CT_Bergstrom and @jevinwest's "Diamond Plot" idea.

arxiv.org/abs/1809.09328
I've neglected finishing the code for it, but I've been playing with an alternative take on the Diamond Plot: the Rotatogram.

Instead of fixed axes / moving regression line, the (orthogonal) regression line is fixed vertically or horizontally, and the axes rotate around it.
Some followups from the responses:

@cdsamii pointed out that this is apparently the literal cover example of David Freedman's "Statistical Models" text. Maybe one day, I'll have a truly original idea. ¯\_(ツ)_/¯

As many have noted, principal components analysis (PCA) is also based on the idea of orthogonal least squared distances. In this simple example, PCA, orthogonal least squares, total least squares and Deming regression are functionally equivalent and get you that symmetrical line.
Most importantly, the main point of this exercise is to understand what question you are asking of the data vs. what question the method you employ is asking of the data.

This doesn't mean PCA or orthogonal regression is "better," it's just asking a different question than OLS.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Noah Haber

Noah Haber Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @NoahHaber

Mar 1, 2022
"DAG With Omitted Objects Displayed (DAGWOOD)" is out in Annals of Epi!

What is DAGWOOD?

A framework?
A method for revealing and reviewing causal inference model assumptions?
A tool for model building?
A statement on epistemic humility?

Answer: Yes.

doi.org/10.1016/j.anne…
This weird paper could only be brought to you by the weird collective minds of @anecdatally, @SarahWieten, @BreskinEpi, and I.

But before I run through it, an acknowledgement:

It's March 1, 2022, and events in Ukraine and across the globe far overshadow any celebration here.
@anecdatally @SarahWieten @BreskinEpi The problem:

Folks often say that DAGs make our causal inference assumptions explicit. But that's only kinda true

The biggest assumptions in a DAG aren't actually IN the DAG; they're in what we assume ISN'T in the DAG. It's all the stuff that's hidden in the white space.
Read 23 tweets
Feb 21, 2022
Time to make it official: short of some unbelievably unlikely circumstances, my academic career is over.

I have officially quit/failed/torpedoed/given up hope on/been failed by the academic system and a career within it.
To be honest, I am angry about it, and have been for years. Enough so that I took a moonshot a few years ago to do something different that might change things or fail trying, publicly.

I could afford to fail since I have unusually awesome outside options.

And here we are.
Who knows what combination of things did me in; incredibly unlucky timing, not fitting in boxes, less "productivity," lack of talent, etc.

In the end, I was rejected from 100% of my TT job and major grant applications.

Always had support from people, but not institutions.
Read 21 tweets
Aug 30, 2021
Causal language study is now up on medRxiv!

medrxiv.org/content/10.110…

Ever wondered what words are commonly used to link exposures and outcomes in health/med/epi studies? How strongly language implies causality? How strongly studies hint at causality in other ways?

READ ON!
Health/med/epi studies commonly avoid using "causal" language for non-RCTs to link exposures and outcomes, under the assumption that ""non-causal"" language is more ""careful.""

But this gets murky, particularly if we want to inform causal q's but use "non-causal" language.
To find answers, and we did a kinda bonkers thing:

GIANT MEGA INTERDISCIPLANARY COLLABORATION LANGUAGE REVIEW

As if that wasn't enough, we also tried to push the boundaries on open science, in hyper transparency and public engagement mode.

Read 27 tweets
Aug 17, 2021
I've done a fair bit of generating simulated data for teaching exercises, methodological demonstrations, etc.

It's really, really hard to make simulated data look "real," and it usually doesn't take much to see it.

That pops up in a lot of these cases.
Granted, we only see the ones that get caught, so "better" frauds are harder to see.

But I think people don't appreciate just how hard it is to make simulated data that don't have an obvious tell, usually because somethig is "too clean" (e.g. the uniform distribution here).
At some point, it's just easier to actually collect the data for real.

BUT.

The ones that I think are going to be particularly hard to catch are the ones that are *mostly* real but fudged a little haphazardly.

If I had to guess, this is probably more common.
Read 6 tweets
Aug 16, 2021
Perpetual reminder: cases going up when there are NPIs (e.g. stay at home orders) in place generally does not tell us much about the impact of the NPIs.

Lots of folks out there making claims based on reading tea leaves from this kind of data and shallow analysis; be careful.
What we want to know is what would have happened if the NPIs were not there. That's EXTREMELY tricky.

How tricky? Well, we would usually expect case/hospitalizations/deaths to have an upward trajectory *even if when the NPIs are extremely effective at preventing those outcomes.*
The interplay of timing, infectious disease dynamics, social changes, data, etc. make it really really difficult to isolate what the NPIs are doing alongside the myriad of other stuff that is happening.

More here: pubmed.ncbi.nlm.nih.gov/34180960/
Read 4 tweets
Jul 22, 2021
The resistance to teaching regression discontinuity as a standard method in epi continues to be baffling.
I can't think of a field for which RDD is a more obviously good fit than epi/medicine.

It's honestly a MUCH better fit for epi and medicine than econ, since healthcare and medicine are just absolutely crawling with arbitrary threshold-based decision metrics.
(psssssst to epi departments: if you want this capability natively for your students and postdocs - and you absolutely do - you should probably hire people with cross-disciplanary training to support it)
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(