Sasha Gusev Profile picture
Feb 21, 2024 20 tweets 9 min read Read on X
I've written about race, genetic ancestry, analyses of large biobanks, and human history



I'll summarize the key points here 🧵: gusevlab.org/projects/hsq/#…
Image
Let's define some terms. Race is a social categorization of people into groups, typically based on physical attributes. Genetic ancestry is a quantification of genetic similarity to a reference population. While correlated, they have fundamentally different causes & consequences. Image
We should care about causes, and race is a poor causal model of human evolution. In truth, genetic variation follows a "nested subsets" model, where all people eventually share ancestors, which is fundamentally different from race (see for yourself here: ). james-kitchens.com/blog/visualizi…
Image
Formally, race-like models do not fit well to the genetic distances we observe even in highly geographically distinct populations with minimal admixture [Long et al 2009] (and as we'll see, mixture is very common). Trying to make a racial model work produces nonsense.
Image
Image
Consistent with nested subsets, a pair of individuals from Africa have more genetic differences than a pair from Africa/France, and the majority of those differences are *common* across all populations [Biddanda 2020]. Population-private common variants are very infrequent. Image
Notably, before genetic data, race advocates predicted that most variation would be homozygous within racial groups and highly divergent. The truth is the opposite even for ancestry: if you condition out population labels you are still left with 85% of the genetic variation. Image
But what about ancestry? Let's go through three core methods for analyzing ancestry in genetic data: dimensionality reduction (PCA), model-based clustering (STRUCTURE), and parametric model (admixture graphs). Because ancestry is relative, each approach has limitations.
1: PCA estimates eigenvectors of the sample relatedness matrix which, under simple ancestry models, are expected to recover population labels. Matrix theory shows that PCA is extremely sensitive with enough data, able to detect relationships down to a handful of generations.
Image
Image
But PCA is easily distorted by the sampling process: bigger populations will warp the PCA locations, even individuals within a single family can look like different populations. PCA also produces unusual artifacts when there is simple spatially locality in the data.

Image
Image
Image
2: STRUCTURE clusters individuals as mixtures of a fixed set of (k) populations. It shares strengths with PCA (sensitive, interpretable) but also limitations (distorted by sampling and parameter choice). Image
[Lawson 2018] provide many examples where STRUCTURE misses known, uh, structure; merges divergent populations together; finds false admixture; or models ancient genomes as mixtures of modern ones.

Image
Image
Image
3: Admixture Graphs use allele-sharing statistics to fit population drift, splits, and mixtures. A very powerful approach but identifiability is a challenge: is the graph you found is significantly better than all other possible graphs? [Maier 2023] show it often is not. Image
Got all that? Now let's look at some real data from biobanks. Again we see that race is a very poor model of human populations, which are continuous mixtures of multiple data sources with no clear boundaries or mapping to any folk racial constructs. Image
As predicted, PCA is sensitive. When we zoom in on populations we see clines all the way down: county-level correlations among self-reported British whites; down to neighborhood level correlations in Chinese and Japanese biobanks. Continuous relationships at all scales.
Image
Image
I sometimes see the argument that, even if race is flawed, genetic ancestry can tell us the "true" races. But this is clearly wrong. These methods depend on a sampling process we cannot know, and real data is full of the mixtures and continuous relationships we just saw. Image
This is even more apparent when looking at populations in history. Dynamic admixture, migration, and continuous structure is historically common and sometimes quite rapid. Geography-ancestry relationships have been rewritten countless times.

Image
Image
Image
Yet again, historic models motivated in part by racial thinking presumed that populations largely evolved through "serial founder" events and developed in isolation. Genetic data shows us this is clearly not the case and our history is much more complicated. Image
Finally, in Africa, we see the limits of even our sophisticated computational models. Highly complex structure and gene flow can fit models of deep separation or continuous migration equally well, even when including ancient DNA. Our genetic history in Africa remains a mystery!
Image
Image
In sum, we use models to understand the causal processes in our world and race is a very poor causal model for genetics. But even models of genetic ancestry have fundamental limitations in light of our complex and dynamic, nested human history. Much more work to be done!

/fin
@threadreaderapp unroll

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sasha Gusev

Sasha Gusev Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @SashaGusevPosts

Nov 21
I wrote a little bit about the "missing heritability" question and several recent studies that have brought it to a close. A short 🧵 Image
For nearly two decades the field has been asking why heritability estimates from molecular studies are so far below estimates from twin studies (). Are molecular studies missing important genetic variation or are twin studies biased by strong assumptions? nature.com/news/2008/0811…Image
Image
Recently, multiple innovative methods have been developed to estimate "narrow-sense" heritability directly from genetic data. These methods make varying assumptions about environments and interactions, and thus allow us to triangulate on the true parameter.
Read 10 tweets
Oct 14
Eric Turkheimer has a good piece about a bet he made with Charles Murray regarding the genetic understanding of IQ (or, really, the lack of it). Murray being so wrong in his prediction should make us question his world model, but it's also worth commenting on his response. Image
Murray has, for some time now, been workshopping the excuse that progress on IQ genetics was blocked by researchers being denied the access to the relevant databases. This is patently untrue! Image
First, one of the largest genetic analyses to date of *any* trait is of educational attainment, a phenotype Murray himself has used as a proxy for intelligence. Surely a study of 3 million should have been enough to satisfy Murray's prediction.

Read 7 tweets
Sep 18
Murray and most of race twitter has apparently been fooled by this completely fabricated analysis purporting to show African ancestry is associated with IQ. People lie on twitter all the time, but this is both more revealing and more disturbing than usual. A 🧵
Revealing in that it shows how quantitative racism is a just an exercise in manipulating data to fit the preconceived conclusion. Disturbing because this time private data is being used and the results, which cannot be easily verified, are just flatly invented.
What's actually going on? Some guy claims to have an analysis showing that African ancestry differences between siblings are associated with IQ differences in the UK Biobank. Implying an ancestry difference in the within-family influences. Image
Read 24 tweets
Aug 1
A few thoughts on Herasight, the new embryo selection company. First, the post below and the white paper imply that competitors like Nucleus have been marketing and selling grossly erroneous risk estimates. This is shocking if true! 🧵
I wrote last year about the un-seriousness with which Nucleus approached their IQ product and the damage it could do to genetic prediction and research more broadly (). This appears to have been a broader pattern beyond IQ, extending even to rare disease.theinfinitesimal.substack.com/p/genomic-pred…
People who care about this technology should be furious at Nucleus and their collaborators (as well as Orchid and Genomic Prediction for their own errors). Finding such flaws should not require reverse-engineering by a competitor. These products clearly need independent audits. Image
Image
Read 14 tweets
Jun 24
Oof. Polygenic scores for IQ lose 75% of their explained variance when adding family controls, even worse than the attenuation for Educational Attainment. These are the scores Silicon Valley is using to select embryos 😬.

A few thoughts on this study ...
The TEDS cohort used here is a very large study with high-quality cognitive assessments collected over multiple time points. It is probably the most impressive twin study of IQ to date. That means very little room for data quality / measurement error issues.
It is important to highlight surprising null results. Just last week we were hypothesizing that large IQ score attenuation could be a study bias or an artifact of the Wilson Effect. Now we see it replicate in an independent study with adults.

Read 12 tweets
Jun 17
@notcomplex_ @krichard1212 The authors fit a non-identifiable Model B, which produces a table full of NA's. Then they try to interpret this model to fix it. That makes no sense. The parameters of this model will be completely arbitrary, so using it to decide what to prune is also statistically invalid.
@notcomplex_ @krichard1212 At various points later on they talk about "Heywood cases", which are out-of-bounds parameters or negative variances, but no such out-of-bounds parameters are actually present in the tables (and, again, you cannot interpret these from the non-identified model).
@notcomplex_ @krichard1212 So none of the decisions make statistical sense and either reflect someone who doesn't know what they're doing or is intentionally trying to find the model fit they like. True to form given they missed a fatal error with model A, misinterpreted AIC comparisons, etc.
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(