Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Stat Shepherd

@stat_sherpa

Oct 5, 2024 • 6 tweets • 5 min read • Read on X

Data Literacy Basics - Part 1
Below are five foundational concepts that EVERYONE should understand (in no particular order). Also, let me know what you would add.

1. Outliers rarely disprove trends.
I see this a lot. People, when presented with a statistic, will often try and discredit it by bringing up edge cases, or outliers. The reality is data, in general, has natural variation, even within a distribution or trend.

We all know this. If I were to say “The average height for an American male is about 5 feet 9 inches,” but my friend chimed in with “That can’t be true! My uncle is 6 feet 8 inches,” you surely wouldn’t agree that single data point disproves my statistic. That's an easy example as we are all familiar with the height of people, but for data we aren’t accustomed with this becomes very important to keep in mind.
🧵 (1/6)

2. Correlation does not imply causation
I’m sure we’ve all heard this ~1000 times, but for good reason. When you see variables, data points, trends, distributions, etc. that are related or move together, this doesn’t necessarily mean one is causing a direct change in the other(s). In general, causal analysis is difficult. There might be other variables not accounted for (called confounding variables) explaining the correlation.

Textbook example: When ice cream sales increase, drowning incidents also tend to increase. However, this does not mean that eating ice cream causes drowning or vice-versa. The real reason for this correlation is that both ice cream sales and drownings increase during the summer, where warmer weather is the underlying cause of both.

Additionally, a correlation could be a coincidence made to look strong through visualization, like the correlation between the consumption of margarine and the divorce rate in Maine.
(2/6)

3. Per capita
Another one I see omitted frequently. Adjusting your numbers to be “per capita” is normalizing your metric to be averaged across individuals. This often allows you to compare averages without worrying much about differences in the number of individuals in the groups.

For example, if we want to understand GDP differences between two countries, just looking at the totals for each may be more of a function of population size than anything else. Dividing each countries respective GDP by the population (i.e. GPD per capita) is usually a better comparison.
When in doubt, focus on per capita.
(3/6)

4. Means vs Medians
Both are usually used for the same goal: understanding what a "typical" value in a dataset might look like. However, the calculations are very different even though I hear them used interchangeably.

The mean is simply the average value of the dataset. Sum everything up and divide by the number of data points (we’re just sticking with the arithmetic mean here). The big downfall with a mean is it’s heavily influenced by extreme outliers.

The median is simply the middle value of the dataset when ordered, therefore it avoids the outlier influence. If your data is relatively “normal” (balanced looking), either will work well. If your data is “skewed” (unbalanced looking), medians (or maybe even modes) might be a better representation of a typical value.
(4/6)

5. Sample size matters, but not as much as you might think
Interestingly, this last one usually trips up people with some data literacy more than those starting from zero. One of the go-to questions for a study is “what was the sample size?” and if you’re asking that, you likely shouldn’t be worried about it. The reality is that you can get very close inferences of a large group (called a population) with a relatively small sample. Sample sizes hit diminishing returns very quick. There’s a lot of fun math as to how and why this is the case that us stats nerds use, but that’s beyond the scope of this.

What is infinitely more important than sample size, is good, representative sampling methods. I could write a whole thread on this (there are entire textbooks and courses on this topic), but just know that with proper sampling methods and study design, you can easily infer statistics about millions with a sample of a couple thousand.
(5/6)

These were just 5 basic ideas off the top of my head. There are more to cover in future posts. Let me know what you would add or expand on. In the future I might dive into more intermediate topics (hypothesis testing, regression analysis, model validation, etc.) occasionally if there’s interest.
(6/6)

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @stat_sherpa

Stat Shepherd

@stat_sherpa

Feb 23

Data Literacy – Part 3: Hypothesis Testing

HOW THE HELL DO DATA NERDS ANSWER QUESTIONS WITH DATA???

Data nerds tend to say stuff like, “The data show a statistically significant difference…” but what does that mean, and how do they decide that?

Below we’ll go through a quick example of how a statistician might answer a question about age differences between Democrats and Republicans (with real data).
🧵(1/7)

Suppose you wanted to test if there is a difference in age between Democrats and Republicans. There are only two possible answers to this initial question:

1. There IS NOT a difference in age between Democrats and Republicans.
2. There IS a difference in age between Democrats and Republicans.

It might seem basic, but in the world of data, your question can have a large effect on the type of analysis you do. For example, the following three questions may require different methods or tests to answer:

1. “On average, is there a difference in age between Democrats and Republicans?”
2. “Are Democrats typically younger than Republicans?”
3. “Do Republicans tend to be older than Democrats?”

For this we’ll focus on the first question and the two possible outcomes (our hypotheses) using data from the 2018 General Social Survey.

One way you might go about determining if there is or is not a significant difference in age is to compare the averages, which is a good start. Here’s the data:

Avg. age Democrats: 44.2 years
Avg. age Republicans: 47.5 years

It looks like Republicans are a bit older by about 3.3 years, but take a second to think about how close these averages would have to be for you to be skeptical about whether a difference exists. In other words, is 3.3 years a “real” difference, or is that just “luck of the draw” based on who answered the survey? If we were to sample different Republicans and Democrats, maybe the difference in age would go the other way!

Read 7 tweets

Stat Shepherd

@stat_sherpa

Oct 12, 2024

Data Literacy Basics – Part 2
Here are five more foundational concepts that EVERYONE should be aware of (again, in no particular order). 🧵

1. Not all differences are significant
When we look at data, it’s easy to think that any change or difference is important. In reality, data fluctuates all the time and it’s difficult to determine if a small change is natural variation or a true difference.

Take the weather as an example. Say the average temperature in a location last June was 75.1°F, but the previous year it was 73.6°F. Sure, the average temperature went up 1.5°F, but is this significant? How does this temp compare to all the previous years? Is this unusual or unexpected? These are good questions to keep in mind that will raise your data literacy.

A lot of Statistics deals with hypothesis testing, which aims to answer the question “how likely is it that this data is just natural variation rather than a significant difference?”
(1/6)

2. Regression to the mean
One tricky aspect of data is that, even when something seems to show a big difference, it might not actually be significant. Extreme outlier events can occur, but they often return to normal quickly, a phenomenon known as regression to the mean.

When people spot an outlier—especially if it's the most recent data point—their immediate reaction is often to assume it signals a change in trends.

You see this often in sports. An athlete might have an amazing game, scoring far beyond their typical performance. Commentators might rave about this "breakthrough." However, in the next few games, their performance often returns to what's more typical for them. This doesn't mean they've lost skill, or something has gone wrong; it's just regression to the mean. The exceptional game was an outlier, and over time, their performance will even out to what it usually is. Keep this in mind when looking at single data points.
(2/6)

3. Standard deviation
An important concept that everyone should know about, especially if you regularly work with data. Standard deviation helps us understand how much the numbers in a dataset vary from the average. Think of it as a measure of consistency.

When the standard deviation is low, it means that most of the data points are pretty close to the average, suggesting that things are fairly stable. On the other hand, a high standard deviation indicates that the data points are more spread out, showing a lot of variability. Understanding standard deviation can give you insight into the reliability of the data you’re looking at.

Imagine you're comparing the customer ratings of two similar products online. Both products have an average rating of 4 out of 5 stars. However, Product A has ratings that are mostly between 3.5 and 4.5 stars, while Product B has ratings that range widely between 1 and 5 stars. Even though both products have the same average rating, the standard deviation for Product A would be lower, indicating more consistent feedback. This might make you feel more confident in choosing Product A, as the ratings are more stable and reliable.
(3/6)

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Stat Shepherd

Try unrolling a thread yourself!

More from @stat_sherpa

Stat Shepherd

Stat Shepherd

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!