Data Literacy Basics - Part 1
Below are five foundational concepts that EVERYONE should understand (in no particular order). Also, let me know what you would add.
1. Outliers rarely disprove trends.
I see this a lot. People, when presented with a statistic, will often try and discredit it by bringing up edge cases, or outliers. The reality is data, in general, has natural variation, even within a distribution or trend.
We all know this. If I were to say “The average height for an American male is about 5 feet 9 inches,” but my friend chimed in with “That can’t be true! My uncle is 6 feet 8 inches,” you surely wouldn’t agree that single data point disproves my statistic. That's an easy example as we are all familiar with the height of people, but for data we aren’t accustomed with this becomes very important to keep in mind.
🧵 (1/6)
2. Correlation does not imply causation
I’m sure we’ve all heard this ~1000 times, but for good reason. When you see variables, data points, trends, distributions, etc. that are related or move together, this doesn’t necessarily mean one is causing a direct change in the other(s). In general, causal analysis is difficult. There might be other variables not accounted for (called confounding variables) explaining the correlation.
Textbook example: When ice cream sales increase, drowning incidents also tend to increase. However, this does not mean that eating ice cream causes drowning or vice-versa. The real reason for this correlation is that both ice cream sales and drownings increase during the summer, where warmer weather is the underlying cause of both.
Additionally, a correlation could be a coincidence made to look strong through visualization, like the correlation between the consumption of margarine and the divorce rate in Maine.
(2/6)
3. Per capita
Another one I see omitted frequently. Adjusting your numbers to be “per capita” is normalizing your metric to be averaged across individuals. This often allows you to compare averages without worrying much about differences in the number of individuals in the groups.
For example, if we want to understand GDP differences between two countries, just looking at the totals for each may be more of a function of population size than anything else. Dividing each countries respective GDP by the population (i.e. GPD per capita) is usually a better comparison.
When in doubt, focus on per capita.
(3/6)
4. Means vs Medians
Both are usually used for the same goal: understanding what a "typical" value in a dataset might look like. However, the calculations are very different even though I hear them used interchangeably.
The mean is simply the average value of the dataset. Sum everything up and divide by the number of data points (we’re just sticking with the arithmetic mean here). The big downfall with a mean is it’s heavily influenced by extreme outliers.
The median is simply the middle value of the dataset when ordered, therefore it avoids the outlier influence. If your data is relatively “normal” (balanced looking), either will work well. If your data is “skewed” (unbalanced looking), medians (or maybe even modes) might be a better representation of a typical value.
(4/6)
5. Sample size matters, but not as much as you might think
Interestingly, this last one usually trips up people with some data literacy more than those starting from zero. One of the go-to questions for a study is “what was the sample size?” and if you’re asking that, you likely shouldn’t be worried about it. The reality is that you can get very close inferences of a large group (called a population) with a relatively small sample. Sample sizes hit diminishing returns very quick. There’s a lot of fun math as to how and why this is the case that us stats nerds use, but that’s beyond the scope of this.
What is infinitely more important than sample size, is good, representative sampling methods. I could write a whole thread on this (there are entire textbooks and courses on this topic), but just know that with proper sampling methods and study design, you can easily infer statistics about millions with a sample of a couple thousand.
(5/6)
These were just 5 basic ideas off the top of my head. There are more to cover in future posts. Let me know what you would add or expand on. In the future I might dive into more intermediate topics (hypothesis testing, regression analysis, model validation, etc.) occasionally if there’s interest.
(6/6)
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Data Literacy Basics – Part 2
Here are five more foundational concepts that EVERYONE should be aware of (again, in no particular order). 🧵
1. Not all differences are significant
When we look at data, it’s easy to think that any change or difference is important. In reality, data fluctuates all the time and it’s difficult to determine if a small change is natural variation or a true difference.
Take the weather as an example. Say the average temperature in a location last June was 75.1°F, but the previous year it was 73.6°F. Sure, the average temperature went up 1.5°F, but is this significant? How does this temp compare to all the previous years? Is this unusual or unexpected? These are good questions to keep in mind that will raise your data literacy.
A lot of Statistics deals with hypothesis testing, which aims to answer the question “how likely is it that this data is just natural variation rather than a significant difference?”
(1/6)
2. Regression to the mean
One tricky aspect of data is that, even when something seems to show a big difference, it might not actually be significant. Extreme outlier events can occur, but they often return to normal quickly, a phenomenon known as regression to the mean.
When people spot an outlier—especially if it's the most recent data point—their immediate reaction is often to assume it signals a change in trends.
You see this often in sports. An athlete might have an amazing game, scoring far beyond their typical performance. Commentators might rave about this "breakthrough." However, in the next few games, their performance often returns to what's more typical for them. This doesn't mean they've lost skill, or something has gone wrong; it's just regression to the mean. The exceptional game was an outlier, and over time, their performance will even out to what it usually is. Keep this in mind when looking at single data points.
(2/6)
3. Standard deviation
An important concept that everyone should know about, especially if you regularly work with data. Standard deviation helps us understand how much the numbers in a dataset vary from the average. Think of it as a measure of consistency.
When the standard deviation is low, it means that most of the data points are pretty close to the average, suggesting that things are fairly stable. On the other hand, a high standard deviation indicates that the data points are more spread out, showing a lot of variability. Understanding standard deviation can give you insight into the reliability of the data you’re looking at.
Imagine you're comparing the customer ratings of two similar products online. Both products have an average rating of 4 out of 5 stars. However, Product A has ratings that are mostly between 3.5 and 4.5 stars, while Product B has ratings that range widely between 1 and 5 stars. Even though both products have the same average rating, the standard deviation for Product A would be lower, indicating more consistent feedback. This might make you feel more confident in choosing Product A, as the ratings are more stable and reliable.
(3/6)