The reason machine learning algorithms show bias is that the goal of these algorithms is to learn ALL the patterns in the data including the biases. The "bias" is actually the gap between what the data scientist THINKS is being learned and what's actually being learned. 🧵
An interesting feature of this bias is it's subjective. It depends on what the data scientist INTENDED to learn from the data. For all we know, the data scientist intended to learn all the patterns in the data, racism and all. In which case, there is no bias.
Generally, machine learning does not require us to be specific about what patterns we are trying to learn. It just vaguely picks up all of them. This means we often have no clue what was learned and if it is what we intended to learn.
Traditional statistics isn't like this. In statistics, the first step is specifying what patterns you want to detect. This requires you to have some kind of theory about the structure of the data. Most importantly, this allows you to check if your theory is wrong.
This issue is an huge weakness of the machine learning approach. The vagueness about what is being learned means that we have to do a lot of work after we fit the model to understand the properties of the model itself. In practice, this work is often not done.
The reason we need to do the work is because we can't rely on theory to tell us what the model learned so we must measure it. This means looking at how the model behaves in order to see if it's racist, sexist or has other biases we might care about.
As we see with the many examples of racist algorithms, many of the people using machine learning mistakenly think that they can rely on their intuitions to guess what kinds of patterns are in their dataset and what kind of patterns their algorithms are learning. This is naive.
I think the solution to racism in algorithms (and other biases of this kind) is to be more hands-on about understanding the processes that created the data your model uses and more proactive and explicit about checking that your models have the properties you think they have. 🧵
• • •
Missing some Tweet in this thread? You can try to
force a refresh
If you think about how statistics works it’s extremely obvious why a model built on purely statistical patterns would “hallucinate”. Explanation in next tweet.
Very simply, statistics is about taking two points you know exist and drawing a line between them, basically completing patterns.
Sometimes that middle point is something that exists in the physical world, sometimes it’s something that could potentially exist, but doesn’t.
Imagine an algorithm that could predict what a couple’s kids might look like. How’s the algorithm supposed to know if one of those kids it predicted actually exists or not?
The child’s existence has no logical relationship to the genomics data the algorithm has available.
These grants aren't charity. They're highly competitive contracts where the US government determines Harvard is the best institution for conducting specific research, and then pays Harvard for services rendered to US taxpayers.
Each grant represents a fair contract that a group at Harvard won after being in competition with hundreds or even thousands of other groups. These are not handouts.
The US government pays Harvard and other universities to provide answers to questions that aren't directly profitable in themselves, but which provide a foundation for private sector innovation, and help maintain American dominance over geopolitical rivals like China.
As a someone who translates ideas into math for a living, I noticed something weird about the tariff formula that I haven't seen anybody else talk about. 🧵
The formula defines the tariff rate as exactly the percent you need to charge on imports to make up for the trade deficit. Basically,
trade deficit = tariff rate x imports
It's constructed as if tariffs are a kind of compensation for trade deficits but this raises a question.
If tariffs are something foreign countries owe to the American people for having a trade deficit, then forcing US businesses to make up for the difference, by paying extra money to the US government, is kind of a weird solution.
Whenever I see students with good grades but lots of college rejections, my first thought is a bad personal essay. As predicted, this guy's essay was kind of a disaster.
Since I did get into Harvard, I'll give my two cents on the essay:
In honor of international women's day, let's take a moment to remember the most famous statistician in history.
You've definitely heard of her, but you probably have no idea she was a statistician.
It's Florence Nightingale.
Nightingale was first female member of the Royal Statistical Society and a pioneer in using statistical analysis to guide medical decisions and public health policy.
Florence Nightingale's most famous statistical analysis was her investigation into the mortality rates of soldiers during the Crimean War. She demonstrated that the majority of deaths among soldiers were due to preventable diseases rather than battlefield injuries!
Took one for the team and made a histogram of the Elon social security data. Not sure why his data scientists are just giving him raw tables like that.
It’s also weird that they keep tweeting out these extremely strong claims without taking a few days to do some basic follow up work.
It doesn’t come off like they even:
- plotted the data
- talked to any of the data collectors
- considered any alternative explanations