Sasha Gusev Profile picture
Feb 21, 2024 20 tweets 9 min read Read on X
I've written about race, genetic ancestry, analyses of large biobanks, and human history



I'll summarize the key points here 🧵: gusevlab.org/projects/hsq/#…
Image
Let's define some terms. Race is a social categorization of people into groups, typically based on physical attributes. Genetic ancestry is a quantification of genetic similarity to a reference population. While correlated, they have fundamentally different causes & consequences. Image
We should care about causes, and race is a poor causal model of human evolution. In truth, genetic variation follows a "nested subsets" model, where all people eventually share ancestors, which is fundamentally different from race (see for yourself here: ). james-kitchens.com/blog/visualizi…
Image
Formally, race-like models do not fit well to the genetic distances we observe even in highly geographically distinct populations with minimal admixture [Long et al 2009] (and as we'll see, mixture is very common). Trying to make a racial model work produces nonsense.
Image
Image
Consistent with nested subsets, a pair of individuals from Africa have more genetic differences than a pair from Africa/France, and the majority of those differences are *common* across all populations [Biddanda 2020]. Population-private common variants are very infrequent. Image
Notably, before genetic data, race advocates predicted that most variation would be homozygous within racial groups and highly divergent. The truth is the opposite even for ancestry: if you condition out population labels you are still left with 85% of the genetic variation. Image
But what about ancestry? Let's go through three core methods for analyzing ancestry in genetic data: dimensionality reduction (PCA), model-based clustering (STRUCTURE), and parametric model (admixture graphs). Because ancestry is relative, each approach has limitations.
1: PCA estimates eigenvectors of the sample relatedness matrix which, under simple ancestry models, are expected to recover population labels. Matrix theory shows that PCA is extremely sensitive with enough data, able to detect relationships down to a handful of generations.
Image
Image
But PCA is easily distorted by the sampling process: bigger populations will warp the PCA locations, even individuals within a single family can look like different populations. PCA also produces unusual artifacts when there is simple spatially locality in the data.

Image
Image
Image
2: STRUCTURE clusters individuals as mixtures of a fixed set of (k) populations. It shares strengths with PCA (sensitive, interpretable) but also limitations (distorted by sampling and parameter choice). Image
[Lawson 2018] provide many examples where STRUCTURE misses known, uh, structure; merges divergent populations together; finds false admixture; or models ancient genomes as mixtures of modern ones.

Image
Image
Image
3: Admixture Graphs use allele-sharing statistics to fit population drift, splits, and mixtures. A very powerful approach but identifiability is a challenge: is the graph you found is significantly better than all other possible graphs? [Maier 2023] show it often is not. Image
Got all that? Now let's look at some real data from biobanks. Again we see that race is a very poor model of human populations, which are continuous mixtures of multiple data sources with no clear boundaries or mapping to any folk racial constructs. Image
As predicted, PCA is sensitive. When we zoom in on populations we see clines all the way down: county-level correlations among self-reported British whites; down to neighborhood level correlations in Chinese and Japanese biobanks. Continuous relationships at all scales.
Image
Image
I sometimes see the argument that, even if race is flawed, genetic ancestry can tell us the "true" races. But this is clearly wrong. These methods depend on a sampling process we cannot know, and real data is full of the mixtures and continuous relationships we just saw. Image
This is even more apparent when looking at populations in history. Dynamic admixture, migration, and continuous structure is historically common and sometimes quite rapid. Geography-ancestry relationships have been rewritten countless times.

Image
Image
Image
Yet again, historic models motivated in part by racial thinking presumed that populations largely evolved through "serial founder" events and developed in isolation. Genetic data shows us this is clearly not the case and our history is much more complicated. Image
Finally, in Africa, we see the limits of even our sophisticated computational models. Highly complex structure and gene flow can fit models of deep separation or continuous migration equally well, even when including ancient DNA. Our genetic history in Africa remains a mystery!
Image
Image
In sum, we use models to understand the causal processes in our world and race is a very poor causal model for genetics. But even models of genetic ancestry have fundamental limitations in light of our complex and dynamic, nested human history. Much more work to be done!

/fin
@threadreaderapp unroll

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sasha Gusev

Sasha Gusev Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @SashaGusevPosts

Jun 24
Oof. Polygenic scores for IQ lose 75% of their explained variance when adding family controls, even worse than the attenuation for Educational Attainment. These are the scores Silicon Valley is using to select embryos 😬.

A few thoughts on this study ...
The TEDS cohort used here is a very large study with high-quality cognitive assessments collected over multiple time points. It is probably the most impressive twin study of IQ to date. That means very little room for data quality / measurement error issues.
It is important to highlight surprising null results. Just last week we were hypothesizing that large IQ score attenuation could be a study bias or an artifact of the Wilson Effect. Now we see it replicate in an independent study with adults.

Read 12 tweets
Jun 11
Racism twitter has taken to arguing that observed racial differences must be "in part" explained by genetic differences, though they demure on how much. Not only is this claim aggressively misleading, it is completely unsupported by data. A 🧵: Image
Image
Image
Genetic differences between any two populations can go in *either* direction, matching the phenotypic differences we observe or going against them. Genes also interact with the environment, which makes the whole notion of "explaining" differences intractable. Image
The mere fact that a trait is heritable within populations tells us nothing about the explanatory factors between populations. See: Lewontin's thought experiment; Freddie de Boer's analogy to a "jumping contest"; or actual derivations (). pubmed.ncbi.nlm.nih.gov/38470926/Image
Image
Image
Read 13 tweets
Jun 6
James Lee and @DamienMorris have an interesting perspective paper out describing "some far-reaching conclusions" about the genetics of intelligence. This type of "where are we now" paper is very fun and more people should write them! So, where are we now? 🧵 Image
It's a short paper and it surveys three core findings from the past decade of intelligence genetics. These sections follow a structure that I would cheekily call ... "make a bold claim in the title, then walk it back in the text".
First up, they address the concern that associations with intelligence may actually be mediated by functionally irrelevant traits like physical appearance or pigment. The argument is that IQ GWAS has demonstrated enrichments for CNS/brain structure gene sets. This is true! Image
Image
Read 19 tweets
May 20
The SAT/meritocracy debate has always been a bit odd to me when the test makers themselves have studies showing self-reported high-school GPA is a consistently better predictor of college GPA and always adds on top of SATs. Image
Clearly SATs are neither the only nor even the best measure we have of college success and "holistic" admissions can be "meritocratic". It's up for debate whether the additional <10% predictive variance SATs give you are worth the high-school testing industrial complex.
A challenge with all of these analyses is they are measured after selection on the predictor variables themselves, which can induce biased estimates through range restriction. The raw correlations are even lower, and it is hard to know whether correcting is appropriate. Image
Read 5 tweets
May 11
Hanania advocated passionately against "race mixing" for years, so he knows what he's talking about here. But it's worth adding that race-IQ obsessives also tend to make very poor predictions about the future. Let's review ...
The Bell Curve, published at the peak of the 80-90's crime wave, predicted a coming dystopian urban hellscape with a "cognitive underclass" living in state-managed facilities. Not only did all this fail to materialize, but crime rates collapsed.

Image
Charles Murray has nevertheless spent the following 30 years predicting vindication for his claims was just around the corner ... each time pointing to a new corner.

Read 11 tweets
Apr 27
Nice! Here we have an interesting paper using genetic ancestry to classify race/ethnicity in modern data and algorithms. Let's take a look at what this paper found: 🧵
First, I don't want to get too hung up on language, but TCB's tweet starts talking about "ethnicity", then shifts to "continental ancestries", and then entirely omits the largest ethnic group in the US: Hispanics. These terms have distinct definitions (). nap.nationalacademies.org/catalog/26902/…Image
Image
Anyway, how well can this paper actually impute ethnicity from genetic ancestry in a large cancer population ()? ~17% of the time it gets Hispanic classification completely wrong or a no-call! worldscientific.com/doi/10.1142/97…Image
Read 18 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(