P(A | B) is the probability of A given B. It is the probability that we will observe A given that we have already observed B.
P(A | do(B)) is the probability of A given do(B). It is the probability that we will observe A given that we have intervened to cause B to happen.
In this context, an intervention simply means to take an action of some kind. Therefore do(B) means to take an action which causes B to happen.
The expressions P(A | B) and P(A | do(B)) might seem very similar but they represent very different situations.
2. We can only learn P(A|B) from the data alone.
Bob has an extremely accurate weather app and is always very good about bringing his umbrella when it rains. We observe Bob over several years and we find that whenever it rains, Bob always has his umbrella and he never brings his umbrellas on days when it doesn't rain.
In the language of probability, we say P(Umbrella | Rain) = 1 and P(Rain | Umbrella) = 1 as well.
What we can learn from this data alone is how to predict whether it rains with a 100% accuracy by checking whether Bob has an umbrella. We can also learn to predict with 100% accuracy whether Bob has an umbrella by checking if it's going to rain.
What we cannot learn is what will happen if we give Bob an umbrella on a random day of our choosing. The answer to this question is P(Rain | do(Umbrella) ) and it's unknowable from the data alone.
We need prior knowledge about how the world works to properly interpret the data we collected. We need to know that rain has an effect on Bob's behavior, but Bob's behavior has no effect on the rain.
Information about the effects of interventions are simply not available in raw data unless it is collected by controlled experimental manipulation.
3. Scientific Experiments work because they produce a very special kind of data.
You may have heard of what many people call a scientific experiment. Take a collection of objects, animals or people. Randomly split that collection into a control group and a treatment group. Apply your intervention to the treatment group while leaving the control group alone. If you observe any differences between the treatment group and the control group, it is logical to attribute these differences to the treatment. You can therefore say the differences were caused by the treatment.
In statistics, the procedure I just described is called a Randomized Controlled Trial. It is a procedure for generating a specific kind of data where:
This is why traditional science experiments work. They are designed to capture causal information. This is not the case for vast majority of data that we collect in society.
Without human guidance or access to real world knowledge, statistical algorithms and artificial intelligences can only learn P(A | B) from the raw data. This is a fundamental mathematical limitation on the use of data alone.
That's it for now. This post is part of a series of posts about the concept of causal inference. They are based on the content of the Book of Why by Judea Pearl with lots of commentary from me.
Follow me (@kareem_carr) so you don't miss out on the next post.
Please show support by liking and retweeting the thread.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
If you think about how statistics works it’s extremely obvious why a model built on purely statistical patterns would “hallucinate”. Explanation in next tweet.
Very simply, statistics is about taking two points you know exist and drawing a line between them, basically completing patterns.
Sometimes that middle point is something that exists in the physical world, sometimes it’s something that could potentially exist, but doesn’t.
Imagine an algorithm that could predict what a couple’s kids might look like. How’s the algorithm supposed to know if one of those kids it predicted actually exists or not?
The child’s existence has no logical relationship to the genomics data the algorithm has available.
These grants aren't charity. They're highly competitive contracts where the US government determines Harvard is the best institution for conducting specific research, and then pays Harvard for services rendered to US taxpayers.
Each grant represents a fair contract that a group at Harvard won after being in competition with hundreds or even thousands of other groups. These are not handouts.
The US government pays Harvard and other universities to provide answers to questions that aren't directly profitable in themselves, but which provide a foundation for private sector innovation, and help maintain American dominance over geopolitical rivals like China.
As a someone who translates ideas into math for a living, I noticed something weird about the tariff formula that I haven't seen anybody else talk about. 🧵
The formula defines the tariff rate as exactly the percent you need to charge on imports to make up for the trade deficit. Basically,
trade deficit = tariff rate x imports
It's constructed as if tariffs are a kind of compensation for trade deficits but this raises a question.
If tariffs are something foreign countries owe to the American people for having a trade deficit, then forcing US businesses to make up for the difference, by paying extra money to the US government, is kind of a weird solution.
Whenever I see students with good grades but lots of college rejections, my first thought is a bad personal essay. As predicted, this guy's essay was kind of a disaster.
Since I did get into Harvard, I'll give my two cents on the essay:
In honor of international women's day, let's take a moment to remember the most famous statistician in history.
You've definitely heard of her, but you probably have no idea she was a statistician.
It's Florence Nightingale.
Nightingale was first female member of the Royal Statistical Society and a pioneer in using statistical analysis to guide medical decisions and public health policy.
Florence Nightingale's most famous statistical analysis was her investigation into the mortality rates of soldiers during the Crimean War. She demonstrated that the majority of deaths among soldiers were due to preventable diseases rather than battlefield injuries!
Took one for the team and made a histogram of the Elon social security data. Not sure why his data scientists are just giving him raw tables like that.
It’s also weird that they keep tweeting out these extremely strong claims without taking a few days to do some basic follow up work.
It doesn’t come off like they even:
- plotted the data
- talked to any of the data collectors
- considered any alternative explanations