Bob is a misogynist who subconsciously avoids intelligent women. Bob randomly selects a list of men and women he knows. He uses statistical best practices to analyze their cognitive abilities and concludes women are less intelligent than men.
Is Bob's analysis scientific?
Reasons to say Bob's analysis is scientific:
1. Anybody else analyzing the same data will get a similar answer
2. The model is predictive of future data. Bob will continue to avoid intelligent women in the future and so the model will accurately predict his future experiences.
A second misogynist Tom replicates Bob's study using people he knows. Tom confirms Bob's findings.
Additionally, a group of 1000 scientifically-inclined misogynists pool together all of the people they know and a third party data scientist finds a similar result.
At this point, the finding is seemingly robust and highly replicable.
Should we now be more convinced than ever that Bob's study is objective and scientific?
If you object that this isn't science, what standard scientific norm could you apply to disqualify the finding?
Note that:
- the data (as collected) are accurate
- the models are predictive of future data
- the finding has been replicated multiple times by multiple groups
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Read on for details as I zoom in on each half of this graphic.
BAYESIAN VERSION
The most common way of interpreting this equation is Bayesian.
The PRIOR (our level of certainty before seeing the data) is updated using the following equation to obtain the POSTERIOR (our level of certainty after seeing the data).
Should we teach Calculus or Data Science in high school? Why not both?
Here's how I'd explain the calculus concept of a limit from a data science perspective:
Imagine you have a machine. Let's call it "f". If we feed a number x into the machine as input, then we get out a new number as output. Let's call the new number f(x).
There's just one issue.
In the real world, your inputs are noisy. So, your output ends up being noisy too.
This is where the idea of a limit comes in. The limit is a guarantee on the quality of your outputs.
Here are three different ways of thinking about linear regression and why it works.
PERSPECTIVE 1: Physics
Hooke's law is a simple mathematical model of a metal spring. It states that the force on a spring is proportional to the length that the spring is stretched.
If you take a bunch of springs that follow Hooke's law and you attach one end to each of your data points and the other end to a straight line, the equilibrium point of that physical system would be the linear regression line.
THREE simple frameworks for thinking about measures of central tendency.
This thread has it all!
Warning: You may have heard people say there's only one thing called "the average" or "the mean". In this thread, we're going to use the word "average" or "mean" to apply to any one of a large family of measures of central tendency.
1. Mode
(Let's start slow. Feel free to skip the stuff you already know!)
This is the value that occurs most frequently in your data.
One of the most valuable classes I took at Harvard was a short course on speed reading. Here's what I learned:
1. Minimize Fixations
Fixations are all the positions where your eyes stop as you're scanning a line of text.
Minimize these by read words in chunks. Don't focus on just one word at a time. Broaden your focus so you're always taking in multiple words at once.
2. Avoid Regressions.
"Regression" is a technical term for going back and reading stuff you just read. It's normal to feel like you need to do this but you don't. It's hard but you have to force yourself to keep pushing forward, and eventually the urge to regress will go away.