Tweet

How to get URL link on Twitter App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Kareem Carr | Data Scientist

@kareem_carr

May 22 • 15 tweets • 5 min read Twitter logo

Read on Twitter

Scrolly

WHY do we divide by n-1 when computing the sample variance?

I've never seen this way of explaining this concept anywhere else.

Read on if you want a completely new way of looking at this.

BACKGROUND

This explanation is going to be confusing if you're rusty on summation notation. So here is a quick review.

If you're comfortable with this concept, skip to the next tweet.

Summation notation is a compact way of talking about adding up n values.

We should also quickly review the "sample mean" or "sample average".

If you are comfortable with this concept, skip ahead to the next tweet.

We compute the sample mean by adding up all our observations and then dividing by the total number of observations.

Here are two key insights which will be important later.

INSIGHT 1: Notice that in the formula for the sample variance, we are subtracting the sample mean from each observation.

INSIGHT 2: We can think of the sample variance as computing the average distance to the sample mean but with an extra correction factor.

Our question then changes from "Why divide by n-1?" to "Where did the correction factor come from?"

IDEA: THE SAMPLE MEAN IS NOT INDEPENDENT OF OUR OBSERVATIONS

Each observation and the sample mean are slightly correlated because the sample mean is computed using all the observations.

The way I like to think about it is we're subtracting -1/n of the observation when we subtract the sample mean. Since we do this n times for each observation, n times -1/n equals 1. We are effectively subtracting 1 observation. This is why we effectively have n-1 observations.

This way of thinking about it is not mathematically rigorous but we can make it more rigorous.

What if we try to decorrelate the sample mean and each observation?

IDEA: DECORRELATING THE VALUES WITH ALGEBRA

I will use the first observation as an example.

STEP 1: We rearrange the terms so the mean no longer contains the first observation.

STEP 2: We rearrange the remaining expression to involve the average of the remaining n-1 values

The decorrelation procedure makes intuitive sense.

The average of the n-1 remaining values is uncorrelated with the first observation, and since it's just a sample containing n-1 values, it's also a reasonable estimate of the average of the total population.

IDEA: WE DON'T NEED TO ACTUALLY DECORRELATE. WE CAN JUST USE A CORRECTION FACTOR

Subtracting the sample mean from the first observation is identical to subtracting the average of all the values excluding the first observation times an extra correlation factor.

This applies to every other observation not just the first.

As you can imagine, recomputing the average of the n-1 remaining observations for each observation is tedious. It's much easier to subtract the same sample mean each time and then account for the correlation afterwards.

IDEA: BESSEL'S CORRECTION CANCELS THE CORRELATION FACTOR

Notice that the correlation factor and Bessel's correction cancel each other out when multiplied.

So that's the story of where the Bessel's correction comes in and why we divide by n-1.

This isn't the whole story. There is one more twist of mathematical luck that makes the algebra work out.

But this is the main idea.

I hope this makes the appearance of n-1 feel less mysterious.

I enjoy explaining math and statistics ideas. Follow me for more content like this, and don't forget to click the little notification bell so you don't miss out on future threads.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @kareem_carr

Kareem Carr | Data Scientist

@kareem_carr

May 16

Wish you knew more about Statistics and Data Science?

Here are my TEN best book recommendations for learners at every level.

NOTE: The books are organized from least technical to most technical.

@TimHarford

BOOK: The Data Detective (@TimHarford)

BEST FOR: Non-statisticians who want simple rules to follow when trying to understand a chart or a statistical analysis.

Read 14 tweets

Kareem Carr | Data Scientist

@kareem_carr

May 15

I have a crazy solution to profiling that I think just might work.

The police should have to compensate people for searching them. This would incentivize the police to be more thoughtful about their searches, and it would compensate groups of people who get searched repeatedly.

I would pay out the same whether the police find something or not so there's less of an incentive to make false claims.

Searching people is a cost we impose on the minority for the benefit of the majority. So I think this is actually a pretty fair solution.

"What if people try to look shady to get a pay out?"

Hard to know how people will react to new systems. Maybe this will not happen too much and it's the cost of doing business. Maybe police will learn to base their searches on more objective criteria than people looking shady.

Read 5 tweets

Kareem Carr | Data Scientist

@kareem_carr

May 14

Statistician here. I see some rookie data science mistakes so let's get into it.

MISTAKE: Interpreting association as causation

Tim Pool implies that being Democrat is the cause of the low fertility rates.

This is not supported by the data shown. The plot itself says the Trump vote is only "associated" with higher fertility rates.

MISTAKE: Unwarranted claims about causal mechanisms

Pool asserts that the cause is abortion but there are likely lots of variables that differ between Trump and Biden counties like college attendance rates and access to birth control.

He's comparing apples and oranges here.

Read 9 tweets

Kareem Carr | Data Scientist

@kareem_carr

May 13

Statistics can never be completely objective.

This is not just my opinion. It's a *mathematical* fact.

Read on if you want to learn a deep fundamental truth about data and its relationship to the universe we live in.

[At the end of this thread, you should also understand why robust social science research is fundamental to the correct interpretation of data related to racial disparities.]

SIMPSON'S PARADOX

"Every statistical relationship between two variables X and Y has the potential to be reversed when we include a third variable Z into the analysis."

This is called Simpson's paradox.

Read 24 tweets

Kareem Carr | Data Scientist

@kareem_carr

May 10

Statistician here.

⚠️WARNING. This is NOT an accurate use of probability theory ⚠️

We can actually explain this one without any math so read on...

He says, "the average American man has a 42% chance of making it to age 85"

This means that of all the baby boys born in particular year, only 42% of them make it to 85.

The question we actually want to ask is what percent of current *82 year olds* (Biden's age at inauguration) would make it to 85?

That number is much higher!

The 42% number is wrong because includes people who died before 82 which is the wrong reference population for Biden.

Read 5 tweets

Kareem Carr | Data Scientist

@kareem_carr

May 8

I know a lot of you wanted a technical breakdown of this meme so here it is!

I don't think you will find this level of detail anywhere else so keep reading if you don't want to miss out.

MISLEADING FORMAT:

The first thing I did was recreate the bar chart. I wanted to make sure that my calculations matched theirs since they seem to have modified the data reported in the original source.

The original table had percentages and those seem to have been used to reverse engineer the numbers in the bar chart.

Read 25 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter Twitter Thread URL to Unroll

Kareem Carr | Data Scientist

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @kareem_carr

Kareem Carr | Data Scientist

Kareem Carr | Data Scientist

Kareem Carr | Data Scientist

Kareem Carr | Data Scientist

Kareem Carr | Data Scientist

Kareem Carr | Data Scientist

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!