This thread walks you through a concrete example of how an algorithm can learn racism. It uses some math but only the minimum amount of math possible and has lots of pictures. It is *very* accessible. If that sounds like your thing read on. 🧵👇
Let's start by learning about statistical bias. Statistical bias is a measure of how well a guessing algorithm is at guessing. It's very straightforward. The bias is the average difference between what an algorithm guesses a value is and what that value actually is. 2/11
The example I'm going to talk about is an algorithm that learns how to measure feelings based on text. We call this measurement a "sentiment score". 3/11
A sentiment score of zero is neutral. A positive score means positive feelings and a negative score means negative feelings. The more positive the sentiment score, the more positive the feelings. The more negative the sentiment score, the more negative the feelings. 4/11
In this example, an amazing thing happens. Our algorithm learns racism! It learns that in general people have negative feelings about certain minorities. Many people will claim the algorithm is malfunctioning but it's not. It's seems to be learning people's actual feelings. 5/11
If we think about the definition of statistical bias at the beginning of the thread, it gives us a hint about what our mistake is. 6/11
What's happening is when we learned feelings from the data, we implicitly defined the "true value" as people's actual feelings. If we don't care about racism then we should have defined it as people's actual feelings excluding racism. 7/11
This issue is called confounding. Confounding happens whenever we compare two things and neglect a third variable that could be driving the difference. 8/11
For instance, we might find that Apple users are more happy than Windows users and conclude that this is because of their computer choices but Apple products cost more. So the real reason Apple users are happier might be because Apple users have more money. 9/11
When people collect data and blindly learn whatever relationships are in the data, they can never be sure that what they're learning is what they intend to learn. They're implicitly making potentially false assumptions about the causal relationships in the data. 10/11
This is why it's extremely important to understand the relationships in the data and why "learning from the data" or being "data-driven" isn't enough when your data doesn't come from real experiments that were designed to generate the right kind of information. 11/11
This kind of long-form content takes extra work so if you like it and want to show support, like and retweet the thread, and give me a follow! 🙃
⚠️ I wanted to clarify that E[x] is statistics notation for the average of x. The E stands for the “expected value”. So E[x] is the “expected value of x”. We say “expected” because the average is basically what you should expect if you try something lots of times.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
I recently saw some reporting by a white conservative journalist which alleges that black Americans don't value education and I wanted to share my personal experience with that as a non-American black man. In 2018, I graduated with a masters in Biostatistics from Harvard. 🧵👇🏾
Traffic was bad on graduation day so I ended up walking through the streets in my full cap and gown. The thing I remember most about that day is all the happy black faces that were congratulating me. Boston is mostly white so it really stood out. 2/6
Black people were literally congratulating me in the streets as I walked passed. Black people that I didn't know were honking their car horns. It clearly meant a lot to all of these people who were probably on their way to work or doing errands. 3/6
As a black man, I'm concerned about the tendency for algorithms to exhibit what looks like racial bias. As a statistician, I'm naturally drawn to investigate why this happens But what is "bias"? Surprisingly, the answer depends on what you think it means to be "rational". 1/7
We can think of bias as a type of irrational behavior. So broadly speaking, there are two ways one could define bias in algorithms and this arises from the two major definitions of rationality. These are epistemic rationality and instrumental rationality. 2/7
Epistemic rationality is defined as the part of rationality which involves achieving accurate beliefs about the world. Instrumental rationality is the art of choosing and implementing actions that steer the future toward outcomes that you want. 3/7
Want to know what kinds of bias are fixable with statistics and how?
Read on... 🧵👇
This is a simple mental map of how different biases affect the process of using algorithms to make changes to the physical world. The way we can fix each bias is as follows...
- Data selection bias: you need an accurate mathematical model of the data creation process
- Statistical bias: you need good statistics
- Bias due to generalization: you need an accurate mathematical model of the observations in the data and in the target population 2/7
To fix the "bias due to causal assumptions", we need to fix all 3 smaller biases. At that point, if your model fits the data well then it should be a very close match to the world. In this case, correlation IS causation and we can say the inputs CAUSE the outputs. 3/7
Jon (@jonst0kes) wrote a thoughtful article about this weekend's events. I don't think he's a fan of "woke" politics but he's pretty good about not making his views the main focus of the piece. "On Saturday, March 27, Kareem Carr stepped on a...landmine" doxa.substack.com/p/understandin…
I don't know what I think of John's sociological analysis but I also don't have a better explanation for why people who I've been friendly with and supportive of for most of my time on Twitter suddenly turned on me. I don't think it's because I was "wrong" because I wasn't.
John argues that I was attacked because I'm proposing a solutions-oriented approach. I can definitely find tweets where my critics were saying one of the "dangerous" myths I was promoting was that there were fixes for bias in algorithms.
FOUR things to know about race and gender bias in algorithms:
1. The bias starts in the data
2. The algorithms don't create the bias but they do transmit it
3. There are a huge number of other biases. Race and gender bias are just the most obvious
4. It's fixable! 🧵👇
By race and gender bias in algorithms, I mean the tendency for heavily data-driven AI algorithms to do things like reproduce negative stereotypes about women and people of color and to center white male subjects as normal or baseline. 2/9
While race and gender bias in algorithms *is fixable*, the current fixes aren't easy. They require us to understand and then mathematically model the processes that generate the biases in the data in the first place. 3/9
Many of the biggest tech trends in data analysis can be seen as increasingly sophisticated answers to the question, "How do we monetize data?" 🧵👇
The first answer to this question was the buzzword BIG DATA. People thought all you needed was a lot of data, didn't matter what kind, and it would basically monetize itself. Unfortunately, this was incorrect. So the next question became "How do we monetize lots of data?" 2/9
The answer to this question turned out to be the next buzzword. DATA SCIENCE. At this point, people still thought data was inherently easy to monetize so they figured anybody could do it. This turned out to be wrong as well. So the new question became... 3/9