The reason machine learning algorithms show bias is that the goal of these algorithms is to learn ALL the patterns in the data including the biases. The "bias" is actually the gap between what the data scientist THINKS is being learned and what's actually being learned. 🧵
An interesting feature of this bias is it's subjective. It depends on what the data scientist INTENDED to learn from the data. For all we know, the data scientist intended to learn all the patterns in the data, racism and all. In which case, there is no bias.
Generally, machine learning does not require us to be specific about what patterns we are trying to learn. It just vaguely picks up all of them. This means we often have no clue what was learned and if it is what we intended to learn.
Traditional statistics isn't like this. In statistics, the first step is specifying what patterns you want to detect. This requires you to have some kind of theory about the structure of the data. Most importantly, this allows you to check if your theory is wrong.
This issue is an huge weakness of the machine learning approach. The vagueness about what is being learned means that we have to do a lot of work after we fit the model to understand the properties of the model itself. In practice, this work is often not done.
The reason we need to do the work is because we can't rely on theory to tell us what the model learned so we must measure it. This means looking at how the model behaves in order to see if it's racist, sexist or has other biases we might care about.
As we see with the many examples of racist algorithms, many of the people using machine learning mistakenly think that they can rely on their intuitions to guess what kinds of patterns are in their dataset and what kind of patterns their algorithms are learning. This is naive.
I think the solution to racism in algorithms (and other biases of this kind) is to be more hands-on about understanding the processes that created the data your model uses and more proactive and explicit about checking that your models have the properties you think they have. 🧵

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with 🔥Kareem Carr | Statistician 🔥

🔥Kareem Carr | Statistician 🔥 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @kareem_carr

Aug 12
One of the things I hate most about the cult of IQ is it leads to lot of magical thinking about how the brain works. There’s absolutely nothing shameful about relearning things you use to know. Image
Research shows forgetting is a normal part of human cognition. Image
The way to combat the natural tendency to forget is to relearn or retrieve the memory at regular intervals which is known as “spaced repetition”.

Relearning takes the strength of the memory back to 100% and the rate of forgetting is slower the next time. Image
Read 7 tweets
Aug 9
How to convince a skeptical colleague that you're right and they're wrong (thread) Image
STEP 1: Show respect for their point of view by making a good faith effort to understand what they're doing and how it compares to your approach
Here is how the two approaches compare for this dataset:

- The original data is in yellow and the sorted data is in black.

- The lines associated with the original and the sorted data are almost the same, but the R² for the sorted data is larger. Image
Read 12 tweets
Aug 5
Infographics of this dataset have been kicking around on the internet for years. It is an insult to real scientists everywhere. For every 10 likes, I will post a new ridiculous fact about how fake and ridiculous this "data" is.
They report data on 185 countries but *104* of those numbers (more than half!) are based on *zero* data collected from people from that country. ZERO.

Rather than acknowledge this lack of data, they decided to guessimate based on surrounding countries.
The IQ estimate for Equatorial Guinea was based on kids in a home for developmentally disabled kids living in Spain. Not even their home country. Spain.
Read 10 tweets
Jun 26
People are getting thousands of likes for spreading this misinformation about sex differences. Let me explain why this interpretation of the data is wrong. 🧵 Image
If you think 100% accuracy is too good to be true, trust your instincts.

The version of the model shown in the plot was basically fed the sex of the participants. That’s why it’s achieving 100% accuracy. Image
When the model was tested on a subset of people from the same dataset that it had *not* seen previously, the accuracy fell to 90%. Image
Read 8 tweets
May 10
I keep seeing this Huberman clip all over my timeline so let’s use it as teachable moment to learn some statistics.
The basic mistake is not taking the people who are already pregnant out of the pool of people who could be pregnant the next month. Of the starting 100, fewer and fewer will remain each month. Image
It’s a little tedious to keep track of what number of people aren’t yet pregnant on each round, and then take 20% of that, and then add up all the pregnant people in each round.
Read 12 tweets
Apr 3
are you always busy but never seem to get enough done? i recently learned a very important lesson about focus:
it's extremely powerful when all your projects fall under one overarching goal such that they feed into and enhance each other.
i think this is why being the weird nerd who only cares about exactly one thing can be so powerful
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(