Stat Shepherd Profile picture
Oct 5 6 tweets 5 min read Read on X
Data Literacy Basics - Part 1
Below are five foundational concepts that EVERYONE should understand (in no particular order). Also, let me know what you would add.

1. Outliers rarely disprove trends.
I see this a lot. People, when presented with a statistic, will often try and discredit it by bringing up edge cases, or outliers. The reality is data, in general, has natural variation, even within a distribution or trend.

We all know this. If I were to say “The average height for an American male is about 5 feet 9 inches,” but my friend chimed in with “That can’t be true! My uncle is 6 feet 8 inches,” you surely wouldn’t agree that single data point disproves my statistic. That's an easy example as we are all familiar with the height of people, but for data we aren’t accustomed with this becomes very important to keep in mind.
🧵 (1/6)Image
2. Correlation does not imply causation
I’m sure we’ve all heard this ~1000 times, but for good reason. When you see variables, data points, trends, distributions, etc. that are related or move together, this doesn’t necessarily mean one is causing a direct change in the other(s). In general, causal analysis is difficult. There might be other variables not accounted for (called confounding variables) explaining the correlation.

Textbook example: When ice cream sales increase, drowning incidents also tend to increase. However, this does not mean that eating ice cream causes drowning or vice-versa. The real reason for this correlation is that both ice cream sales and drownings increase during the summer, where warmer weather is the underlying cause of both.

Additionally, a correlation could be a coincidence made to look strong through visualization, like the correlation between the consumption of margarine and the divorce rate in Maine.
(2/6)Image
3. Per capita
Another one I see omitted frequently. Adjusting your numbers to be “per capita” is normalizing your metric to be averaged across individuals. This often allows you to compare averages without worrying much about differences in the number of individuals in the groups.

For example, if we want to understand GDP differences between two countries, just looking at the totals for each may be more of a function of population size than anything else. Dividing each countries respective GDP by the population (i.e. GPD per capita) is usually a better comparison.
When in doubt, focus on per capita.
(3/6)
4. Means vs Medians
Both are usually used for the same goal: understanding what a "typical" value in a dataset might look like. However, the calculations are very different even though I hear them used interchangeably.

The mean is simply the average value of the dataset. Sum everything up and divide by the number of data points (we’re just sticking with the arithmetic mean here). The big downfall with a mean is it’s heavily influenced by extreme outliers.

The median is simply the middle value of the dataset when ordered, therefore it avoids the outlier influence. If your data is relatively “normal” (balanced looking), either will work well. If your data is “skewed” (unbalanced looking), medians (or maybe even modes) might be a better representation of a typical value.
(4/6)Image
Image
5. Sample size matters, but not as much as you might think
Interestingly, this last one usually trips up people with some data literacy more than those starting from zero. One of the go-to questions for a study is “what was the sample size?” and if you’re asking that, you likely shouldn’t be worried about it. The reality is that you can get very close inferences of a large group (called a population) with a relatively small sample. Sample sizes hit diminishing returns very quick. There’s a lot of fun math as to how and why this is the case that us stats nerds use, but that’s beyond the scope of this.

What is infinitely more important than sample size, is good, representative sampling methods. I could write a whole thread on this (there are entire textbooks and courses on this topic), but just know that with proper sampling methods and study design, you can easily infer statistics about millions with a sample of a couple thousand.
(5/6)Image
Image
Image
These were just 5 basic ideas off the top of my head. There are more to cover in future posts. Let me know what you would add or expand on. In the future I might dive into more intermediate topics (hypothesis testing, regression analysis, model validation, etc.) occasionally if there’s interest.
(6/6)

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Stat Shepherd

Stat Shepherd Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(