Kareem Carr | Data Scientist Profile picture
May 13 24 tweets 5 min read Twitter logo Read on Twitter
Statistics can never be completely objective.

This is not just my opinion. It's a *mathematical* fact.

Read on if you want to learn a deep fundamental truth about data and its relationship to the universe we live in. Image
[At the end of this thread, you should also understand why robust social science research is fundamental to the correct interpretation of data related to racial disparities.]
SIMPSON'S PARADOX

"Every statistical relationship between two variables X and Y has the potential to be reversed when we include a third variable Z into the analysis."

This is called Simpson's paradox.
A famous example of this is data at the heart of the 1973 UC Berkeley Gender-Bias case.

There seemed to be a clear bias in the admissions rates of men vs women. About 45% of men who applied were admitted compared to only 30% of the women.
But once choice of major was added to the analysis, it was revealed that women were being accepted at higher rates in almost every field.

This seems like a contradiction at first but the solution is rather simple. Image
A disproportionate percentage of men were applying to easier majors to get into, like Departments A and B.

Approximately 1000 more men than women had applied to those departments alone and this difference shows up in the aggregated data as a higher admissions rate for men.
To summarize, being male seemed to be associated with an increase in the admissions rate but adding a third factor, choice of major, *reversed* the association.

Being male was actually associated with a decrease in admissions rate.

We call this a "reversal".
REVERSALS

The phenomenon of "reversal" should disturb you. It means the conclusion of any data analysis could be completely flipped just by adding an extra variable to the statistical model.
The situation is actually even worse than it first seems.

It is indeed possible to have chains of reversals where *every* new variable added to a data analysis *reverses* the direction of the relationship established in the previous analysis.
"Did being black help Obama become president?

For most of his life, almost certainly no, but perhaps during the primary debates it helped him stand out, but once he was running it probably lost him voters that he would have otherwise won over had he been white.
"Do masks reduce covid infections?"

Maybe in general they do, but if you use them badly they don't, but if you always socially distance, perhaps it doesn't change much either way.
PREDICTION

It might seem at first that this means that statistics is useless as a source of knowledge, but each additional variable actually tends to make our statistical model better.

This might seem like a contradiction but it's not.
Generally speaking, every new variable added to the model will improve the accuracy of its predictions. It is only the interpretation of the relationships between the outcome and input variables that might change.

In cases where interpretation doesn't matter, this can be enough.
CAUSATION

The correct relationship between X and Y can only be found by performing statistical analyses which respect the causal relationship between all the variables in the analysis.

Therefore, causal information is fundamental to the accuracy of statistical analyses.
Let me explain what I mean by "casual information".

Imagine we're trying to quantify the effect of our summer math camp program on student's math grades during the year, and we have information on math camp attendance, final grades, GRE scores and GPAs.
Adding causal information to our analysis means we need to produce a diagram like this one.

For each variable in the analysis, we need to identify all the relevant variables it is influenced by and all the relevant variables it influences. Image
Three important sources of causal information are:

1. Commonsense Assumptions. The analyst can make reasonable assumptions and try their best to justify them in their report.
2. Expertise/Scientific knowledge. Having the infrastructure of an entire science behind you can be extremely useful in making the tough decisions about which causal relationships are real.
3. Experimental Design. We can collect data in a way that makes all the causal paths that we don't care about physically impossible, and allows us to isolate the causal path that we do care about.
OBJECTIVITY

None of these ways of getting causal information is infallible. Therefore, no statistical analysis is infallible. All rely on some prior knowledge.
Many statistical analyses do not state what prior information was used which makes them seem "objective".

This does not mean that none were used.

It only means the audience was robbed of the ability to fully assess the quality of the causal information used and of the analysis.
A statistical analysis can never be completely objective because it lives or dies on the quality of the analyst's prior knowledge.

This is a mathematical constraint on humans, AI or anything else that needs to analyze data.

It is a fundamental property of our universe.
For more information, check out:
- "The Book of Why" by Judea Pearl [Accessible to a General Audience]
- "Understanding Simpson's Paradox" also by Pearl [Short. Technical. Good discussion of Reversals]
- "What If" by Hernán and Robins [Free Online Textbook] Image
I enjoy explaining math and statistics ideas in a way that regular people can understand.

Follow me for more content like this, and don't forget to click the little notification bell so you don't miss out on future threads. Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Kareem Carr | Data Scientist

Kareem Carr | Data Scientist Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @kareem_carr

May 14
Statistician here. I see some rookie data science mistakes so let's get into it. Image
MISTAKE: Interpreting association as causation

Tim Pool implies that being Democrat is the cause of the low fertility rates.

This is not supported by the data shown. The plot itself says the Trump vote is only "associated" with higher fertility rates.
MISTAKE: Unwarranted claims about causal mechanisms

Pool asserts that the cause is abortion but there are likely lots of variables that differ between Trump and Biden counties like college attendance rates and access to birth control.

He's comparing apples and oranges here.
Read 9 tweets
May 10
Statistician here.

⚠️WARNING. This is NOT an accurate use of probability theory ⚠️

We can actually explain this one without any math so read on... Image
He says, "the average American man has a 42% chance of making it to age 85"

This means that of all the baby boys born in particular year, only 42% of them make it to 85.
The question we actually want to ask is what percent of current *82 year olds* (Biden's age at inauguration) would make it to 85?

That number is much higher!

The 42% number is wrong because includes people who died before 82 which is the wrong reference population for Biden.
Read 5 tweets
May 8
I know a lot of you wanted a technical breakdown of this meme so here it is!

I don't think you will find this level of detail anywhere else so keep reading if you don't want to miss out. Image
MISLEADING FORMAT:

The first thing I did was recreate the bar chart. I wanted to make sure that my calculations matched theirs since they seem to have modified the data reported in the original source. Image
The original table had percentages and those seem to have been used to reverse engineer the numbers in the bar chart. Image
Read 25 tweets
May 6
This bar chart has attracted the attention of the richest man in the world. Let me walk you through how I would interpret it as a statistician (and a human). Image
I am sure this data is in many ways dubious and the claim that the media exclusively focuses on white-on-black crime is untrue but let’s set that aside for now.

I want to talk about the biases in how people present data.
I like to say Statistics is critical thinking with numbers.

As a statistician, I want these numbers to help me understand *why* things are happening and what I can do about it.
Read 16 tweets
May 5
Hey everyone. Just wanted to say that I’ve seen all your amazing messages of support. Thanks for believing in me. Image
I’m going to be real though. There have been at least a half dozen more anonymous cowards calling me the n-word in my DMs.

Don’t worry. I just block and move on.

There are so many more of you awesome people than there are of them.
I got a few DMs just now so I went into my DMs and looks like I got another one earlier today. You can’t make this stuff up lol. These people are the worst. Image
Read 4 tweets
May 3
I woke up to this.

Whenever I tweet about IQ, no matter how technical my critique, I’m attacked for my race.

People assert without evidence that my IQ is low, that I’m an affirmation action candidate, that my credentials are fake, that I’m bad at math. I am called slurs. Image
In my darker moments, I fear that many will find these attacks plausible because it plays into pervasive stereotypes about black people.
Like anyone else, I’m proud of my heritage and deeply value my connection to the African diaspora.

But I don’t like being reduced to just my race.

I can’t help but feel robbed of my personhood and diminished by these grotesque and simplistic depictions of black people.
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(