Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

🔥 Dr Kareem Carr 🔥

@kareem_carr

Sep 15, 2023 • 1 tweets • 3 min read • Read on X

Every Data Scientist needs to know these ideas.

They will blow your mind.

1. Correlation vs Causation

P(A | B) is the probability of A given B. It is the probability that we will observe A given that we have already observed B.

P(A | do(B)) is the probability of A given do(B). It is the probability that we will observe A given that we have intervened to cause B to happen.

In this context, an intervention simply means to take an action of some kind. Therefore do(B) means to take an action which causes B to happen.

The expressions P(A | B) and P(A | do(B)) might seem very similar but they represent very different situations.

2. We can only learn P(A|B) from the data alone.

Bob has an extremely accurate weather app and is always very good about bringing his umbrella when it rains. We observe Bob over several years and we find that whenever it rains, Bob always has his umbrella and he never brings his umbrellas on days when it doesn't rain.

In the language of probability, we say P(Umbrella | Rain) = 1 and P(Rain | Umbrella) = 1 as well.

What we can learn from this data alone is how to predict whether it rains with a 100% accuracy by checking whether Bob has an umbrella. We can also learn to predict with 100% accuracy whether Bob has an umbrella by checking if it's going to rain.

What we cannot learn is what will happen if we give Bob an umbrella on a random day of our choosing. The answer to this question is P(Rain | do(Umbrella) ) and it's unknowable from the data alone.

We need prior knowledge about how the world works to properly interpret the data we collected. We need to know that rain has an effect on Bob's behavior, but Bob's behavior has no effect on the rain.

Information about the effects of interventions are simply not available in raw data unless it is collected by controlled experimental manipulation.

3. Scientific Experiments work because they produce a very special kind of data.

You may have heard of what many people call a scientific experiment. Take a collection of objects, animals or people. Randomly split that collection into a control group and a treatment group. Apply your intervention to the treatment group while leaving the control group alone. If you observe any differences between the treatment group and the control group, it is logical to attribute these differences to the treatment. You can therefore say the differences were caused by the treatment.

In statistics, the procedure I just described is called a Randomized Controlled Trial. It is a procedure for generating a specific kind of data where:

P(Difference | Treatment) = P(Difference | do(Treatment) )

This is why traditional science experiments work. They are designed to capture causal information. This is not the case for vast majority of data that we collect in society.

Without human guidance or access to real world knowledge, statistical algorithms and artificial intelligences can only learn P(A | B) from the raw data. This is a fundamental mathematical limitation on the use of data alone.

That's it for now. This post is part of a series of posts about the concept of causal inference. They are based on the content of the Book of Why by Judea Pearl with lots of commentary from me.

Follow me (@kareem_carr) so you don't miss out on the next post.

Please show support by liking and retweeting the thread.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @kareem_carr

🔥 Dr Kareem Carr 🔥

@kareem_carr

Jun 5

You may have heard hallucinations are a big problem in AI, that they make stuff up that sounds very convincing, but isn't real.

Hallucinations aren't the real issue. The real issue is Exact vs Approximate, and it's a much, much bigger problem.

When you fit a curve to data, you have choices.

You can force it to pass through every point, or you can approximate the overall shape of the points without hitting any single point exactly.

When it comes to AI, there's a similar choice.

These models are built to match the shape of language. In any given context, the model can either produce exactly the text it was trained on, or it can produce text that's close but not identical

Read 10 tweets

🔥 Dr Kareem Carr 🔥

@kareem_carr

Jun 2

I’m deeply skeptical of the AI hype because I’ve seen this all before. I’ve watched Silicon Valley chase the dream of easy money from data over and over again, and they always hit a wall.

Story time.

First it was big data. The claim was that if you just piled up enough data, the answers would be so obvious that even the dumbest algorithm or biggest idiot could see them.

Models were an afterthought. People laughed at you if you said the details mattered.

Unsurprisingly, it didn't work out.

Next came data scientists. The idea was simple: hire smart science PhDs, point them at your pile of data, wait for the monetizable insights to roll in.

Read 13 tweets

🔥 Dr Kareem Carr 🔥

@kareem_carr

Jun 1

As a statistician, this is extremely alarming. I’ve spent years thinking about the ethical principles that guide data analysis. Here are a few that feel most urgent:

RESPECT AUTONOMY

Collect data only with meaningful consent. People deserve control over how their information is used.

Example: If you're studying mobile app behavior, don’t log GPS location unless users explicitly opt in and understand the implications.

DO NO HARM

Anticipate and prevent harm, including breaches of privacy and stigmatization.

Example: If 100% of a small town tests positive for HIV, reporting that stat would violate privacy. Aggregating to the county level protects individuals while keeping the data useful.

Read 9 tweets

🔥 Dr Kareem Carr 🔥

@kareem_carr

May 8

The kids using ChatGPT to cheat are massively fumbling the ball.

I would give almost anything to experience learning something like calculus for the first time with an AI assistant.

I have wasted an ungodly amount of time on poorly written math textbooks.

Confusing notation. Poorly worded statements that I puzzled over for hours. Typos that had me questioning my sanity for days.

These kids won't ever have to go through that.

They'll take a picture of the page, ask ChatGPT what it means, and instantly get an explanation tailored to exactly their level.

Read 7 tweets

🔥 Dr Kareem Carr 🔥

@kareem_carr

May 7

Hot take: Students using chatgpt to cheat are just following the system’s logic to its natural conclusion, a system that treats learning as a series of hoops to jump through, not a path to becoming more fully oneself.

The tragedy is that teachers and students actually want the same thing, for the student to grow in capability and agency, but school pits them against each other, turning learning into compliance and grading into surveillance.

Properly understood, passing up a real chance to learn is like skipping out on great sex or premium ice cream. One could but why would one want to?

Read 6 tweets

🔥 Dr Kareem Carr 🔥

@kareem_carr

Apr 25

If you think about how statistics works it’s extremely obvious why a model built on purely statistical patterns would “hallucinate”. Explanation in next tweet.

Very simply, statistics is about taking two points you know exist and drawing a line between them, basically completing patterns.

Sometimes that middle point is something that exists in the physical world, sometimes it’s something that could potentially exist, but doesn’t.

Imagine an algorithm that could predict what a couple’s kids might look like. How’s the algorithm supposed to know if one of those kids it predicted actually exists or not?

The child’s existence has no logical relationship to the genomics data the algorithm has available.

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

🔥 Dr Kareem Carr 🔥

Try unrolling a thread yourself!

More from @kareem_carr

🔥 Dr Kareem Carr 🔥

🔥 Dr Kareem Carr 🔥

🔥 Dr Kareem Carr 🔥

🔥 Dr Kareem Carr 🔥

🔥 Dr Kareem Carr 🔥

🔥 Dr Kareem Carr 🔥

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!