Paul Hünermund is on Bluesky Profile picture
This account is permanently inactive. Find me at: https://t.co/6pqaIc6kxY

Jun 24, 2022, 13 tweets

This is my favorite teaching example for showing the importance of #CausalInference: @Google conducts an annual pay equity analysis in which they use fairly advanced statistical techniques. In 2019 they found that they were actually underpaying MEN?! npr.org/2019/03/05/700… 1/

What do they do specifically? They collect a lot of data (as Google does) and then run OLS regressions of annual compensation on demographic variables (gender, race) and other explanatory variables such as tenure, location, and performance. services.google.com/fh/files/blogs… 2/

If they find statistically meaningful differences, @Google is actually committed to make upward adjustments for the disadvantaged groups. In this case it was male, level-4 software engineers who got a raise. 3/

But here comes the problem: Google runs these regressions separately for specific groups of employees, based on their job level and function. They do this to avoid comparing 🍎 with 🍐. And why wouldn't you? 4/

Well, we know that adjusting for a third variable can sometimes do funny things to the sign of a statistical relationship. This is the famous Simpson's paradox, named after the British statistician Edward Simpson (another white dude). everydayconcepts.io/simpsons-parad… 5/

It could very well be that women are overall paid less at an organization like Google, but if you adjust for a third variable like job level or function, the sign flips and suddenly you get the exact opposite direction for the relationship. 6/

To find the right answer, we cannot simply look at the data, because there is nothing in it that can tell us how to properly analyze it — no matter how large it is and how finely we can slice it. We need to make causal assumptions! 7/

Variables such as job level and function are likely affected by gender, because we know from prior literature that there are, e.g., child penalties for women and gender-specific occupation choices. This turns them into so-called "post-treatment variables". 8/

At the same time, there might be many determinants of an employee's job level and compensation that even @Google can't observe in their vast data. One prime candidate for such unobservables are personal job-related skills, which we often only have rough proxies for. 9/

But if we now want to estimate the effect of gender on compensation, job level becomes a collider. If we control for it, by running separate regressions for each job level, we create a bias that stems from the fact that employees with higher skills receive higher salaries. 10/

The intuition here is that women have more obstacles to overcome to make it to higher-level positions. Those women that make it nonetheless are often a specifically selected group with likely higher skills than average. This higher skill level pushes their annual pay. 11/

So especially in groups with higher seniority you will find women that consistently over performed throughout their career to make it this far. It is therefore not surprising that they might also receive, e.g., higher bonuses than their male peers. 12/

More on these causal inference challengenes and the dangers of estimating the gender wage gap with sophisticated ML methods without a proper theory behind it, can be found in this paper: arxiv.org/abs/2108.11294 Thanks for reading! 13/13

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling