Kareem Carr, Statistics Person Profile picture
Sep 15, 2023 1 tweets 3 min read Read on X
Every Data Scientist needs to know these ideas.

They will blow your mind.

1. Correlation vs Causation

P(A | B) is the probability of A given B. It is the probability that we will observe A given that we have already observed B.

P(A | do(B)) is the probability of A given do(B). It is the probability that we will observe A given that we have intervened to cause B to happen.

In this context, an intervention simply means to take an action of some kind. Therefore do(B) means to take an action which causes B to happen.

The expressions P(A | B) and P(A | do(B)) might seem very similar but they represent very different situations.

2. We can only learn P(A|B) from the data alone.

Bob has an extremely accurate weather app and is always very good about bringing his umbrella when it rains. We observe Bob over several years and we find that whenever it rains, Bob always has his umbrella and he never brings his umbrellas on days when it doesn't rain.

In the language of probability, we say P(Umbrella | Rain) = 1 and P(Rain | Umbrella) = 1 as well.

What we can learn from this data alone is how to predict whether it rains with a 100% accuracy by checking whether Bob has an umbrella. We can also learn to predict with 100% accuracy whether Bob has an umbrella by checking if it's going to rain.

What we cannot learn is what will happen if we give Bob an umbrella on a random day of our choosing. The answer to this question is P(Rain | do(Umbrella) ) and it's unknowable from the data alone.

We need prior knowledge about how the world works to properly interpret the data we collected. We need to know that rain has an effect on Bob's behavior, but Bob's behavior has no effect on the rain.

Information about the effects of interventions are simply not available in raw data unless it is collected by controlled experimental manipulation.

3. Scientific Experiments work because they produce a very special kind of data.

You may have heard of what many people call a scientific experiment. Take a collection of objects, animals or people. Randomly split that collection into a control group and a treatment group. Apply your intervention to the treatment group while leaving the control group alone. If you observe any differences between the treatment group and the control group, it is logical to attribute these differences to the treatment. You can therefore say the differences were caused by the treatment.

In statistics, the procedure I just described is called a Randomized Controlled Trial. It is a procedure for generating a specific kind of data where:

P(Difference | Treatment) = P(Difference | do(Treatment) )

This is why traditional science experiments work. They are designed to capture causal information. This is not the case for vast majority of data that we collect in society.

Without human guidance or access to real world knowledge, statistical algorithms and artificial intelligences can only learn P(A | B) from the raw data. This is a fundamental mathematical limitation on the use of data alone.

That's it for now. This post is part of a series of posts about the concept of causal inference. They are based on the content of the Book of Why by Judea Pearl with lots of commentary from me.

Follow me (@kareem_carr) so you don't miss out on the next post.

Please show support by liking and retweeting the thread.
Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Kareem Carr, Statistics Person

Kareem Carr, Statistics Person Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @kareem_carr

Jan 23
This is a resource thread about the Datasaurus Dozen data and how to get it.

The Datasaurus Dozen is a collection of extremely different datasets with near identical summary statistics.

It’s a reminder to all of us to ALWAYS plot our data.
Here’s what all the datasets look like: Image
It’s available through R using the following code. Technically, all you need is the library call:

library(“datasauRus”)

and then you can access the datasauruss_dozen variable containing the datasets. The rest is just for plotting. Image
Read 6 tweets
Jan 20
Nassim Taleb has written a devastatingly strong critique of IQ, but since he writes at such a technical level, his most powerful insights are being missed.

Let me explain just one of them. 🧵 Image
Taleb raises an intriguing question: what if IQ isn't measuring intelligence at all, but instead merely detecting the many ways in which things can go wrong with a brain?
Imagine a situation like this, where there's no real difference between having an IQ of 100-160 in terms of real world outcomes, but an IQ of 40-100 suggests something has gone seriously wrong in a person's life: anything from lead poisoning to severe poverty. Image
Read 11 tweets
Jan 15
Here's something counterintuitive, that a lot of people don't understand about heritability as it relates to race, if skin color is heritable, and discrimination based on skin color is common, the bad outcomes due to racism is going to be heritable as well.
Whenever you get any race-related heritability numbers, the first thing you absolutely should do is ask the person giving you those numbers what they did to rule these pathways out as a possibility.
In my experience, the answer is almost always nothing.
Read 4 tweets
Jan 15
hey now, this is the guy that said your tweet was racist. go yell at him not me. Image
Let me break this down. The original tweet is doing the statistical equivalent of this. Image
It makes no sense to treat a white person being killed by a black person as special and different from a white person being killed by another white person.
Read 7 tweets
Jan 13
It feels racist because it’s a white nationalist framing of these data. This is a textbook example of how to lie with statistics. Image
My main criticism is he didn't even provide a source. So, 100k+ people have seen this and we don't even know if there is any real data here.
The best way to lie with statistics is to just make them up.
Read 10 tweets
Dec 30, 2024
According to a recent paper, the vast majority of academics gain their elite status the old-fashioned way, they were born with rich parents. Image
Academics are more likely to have rich parents than teachers, lawyers and judges, and even physicians and surgeons. Image
Even academics at MIT are more likely to have rich parents. Notice that MIT is higher on the list than NYU, a school that is notorious for being full of kids with rich parents (like Trump’s son for instance). Image
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(