Let's define some terms. Race is a social categorization of people into groups, typically based on physical attributes. Genetic ancestry is a quantification of genetic similarity to a reference population. While correlated, they have fundamentally different causes & consequences.
We should care about causes, and race is a poor causal model of human evolution. In truth, genetic variation follows a "nested subsets" model, where all people eventually share ancestors, which is fundamentally different from race (see for yourself here: ). james-kitchens.com/blog/visualizi…
Formally, race-like models do not fit well to the genetic distances we observe even in highly geographically distinct populations with minimal admixture [Long et al 2009] (and as we'll see, mixture is very common). Trying to make a racial model work produces nonsense.
Consistent with nested subsets, a pair of individuals from Africa have more genetic differences than a pair from Africa/France, and the majority of those differences are *common* across all populations [Biddanda 2020]. Population-private common variants are very infrequent.
Notably, before genetic data, race advocates predicted that most variation would be homozygous within racial groups and highly divergent. The truth is the opposite even for ancestry: if you condition out population labels you are still left with 85% of the genetic variation.
But what about ancestry? Let's go through three core methods for analyzing ancestry in genetic data: dimensionality reduction (PCA), model-based clustering (STRUCTURE), and parametric model (admixture graphs). Because ancestry is relative, each approach has limitations.
1: PCA estimates eigenvectors of the sample relatedness matrix which, under simple ancestry models, are expected to recover population labels. Matrix theory shows that PCA is extremely sensitive with enough data, able to detect relationships down to a handful of generations.
But PCA is easily distorted by the sampling process: bigger populations will warp the PCA locations, even individuals within a single family can look like different populations. PCA also produces unusual artifacts when there is simple spatially locality in the data.
2: STRUCTURE clusters individuals as mixtures of a fixed set of (k) populations. It shares strengths with PCA (sensitive, interpretable) but also limitations (distorted by sampling and parameter choice).
[Lawson 2018] provide many examples where STRUCTURE misses known, uh, structure; merges divergent populations together; finds false admixture; or models ancient genomes as mixtures of modern ones.
3: Admixture Graphs use allele-sharing statistics to fit population drift, splits, and mixtures. A very powerful approach but identifiability is a challenge: is the graph you found is significantly better than all other possible graphs? [Maier 2023] show it often is not.
Got all that? Now let's look at some real data from biobanks. Again we see that race is a very poor model of human populations, which are continuous mixtures of multiple data sources with no clear boundaries or mapping to any folk racial constructs.
As predicted, PCA is sensitive. When we zoom in on populations we see clines all the way down: county-level correlations among self-reported British whites; down to neighborhood level correlations in Chinese and Japanese biobanks. Continuous relationships at all scales.
I sometimes see the argument that, even if race is flawed, genetic ancestry can tell us the "true" races. But this is clearly wrong. These methods depend on a sampling process we cannot know, and real data is full of the mixtures and continuous relationships we just saw.
This is even more apparent when looking at populations in history. Dynamic admixture, migration, and continuous structure is historically common and sometimes quite rapid. Geography-ancestry relationships have been rewritten countless times.
Yet again, historic models motivated in part by racial thinking presumed that populations largely evolved through "serial founder" events and developed in isolation. Genetic data shows us this is clearly not the case and our history is much more complicated.
Finally, in Africa, we see the limits of even our sophisticated computational models. Highly complex structure and gene flow can fit models of deep separation or continuous migration equally well, even when including ancient DNA. Our genetic history in Africa remains a mystery!
In sum, we use models to understand the causal processes in our world and race is a very poor causal model for genetics. But even models of genetic ancestry have fundamental limitations in light of our complex and dynamic, nested human history. Much more work to be done!
/fin
@threadreaderapp unroll
• • •
Missing some Tweet in this thread? You can try to
force a refresh
I wrote a little bit about the "missing heritability" question and several recent studies that have brought it to a close. A short 🧵
For nearly two decades the field has been asking why heritability estimates from molecular studies are so far below estimates from twin studies (). Are molecular studies missing important genetic variation or are twin studies biased by strong assumptions? nature.com/news/2008/0811…
Recently, multiple innovative methods have been developed to estimate "narrow-sense" heritability directly from genetic data. These methods make varying assumptions about environments and interactions, and thus allow us to triangulate on the true parameter.
Eric Turkheimer has a good piece about a bet he made with Charles Murray regarding the genetic understanding of IQ (or, really, the lack of it). Murray being so wrong in his prediction should make us question his world model, but it's also worth commenting on his response.
Murray has, for some time now, been workshopping the excuse that progress on IQ genetics was blocked by researchers being denied the access to the relevant databases. This is patently untrue!
First, one of the largest genetic analyses to date of *any* trait is of educational attainment, a phenotype Murray himself has used as a proxy for intelligence. Surely a study of 3 million should have been enough to satisfy Murray's prediction.
Murray and most of race twitter has apparently been fooled by this completely fabricated analysis purporting to show African ancestry is associated with IQ. People lie on twitter all the time, but this is both more revealing and more disturbing than usual. A 🧵
Revealing in that it shows how quantitative racism is a just an exercise in manipulating data to fit the preconceived conclusion. Disturbing because this time private data is being used and the results, which cannot be easily verified, are just flatly invented.
What's actually going on? Some guy claims to have an analysis showing that African ancestry differences between siblings are associated with IQ differences in the UK Biobank. Implying an ancestry difference in the within-family influences.
A few thoughts on Herasight, the new embryo selection company. First, the post below and the white paper imply that competitors like Nucleus have been marketing and selling grossly erroneous risk estimates. This is shocking if true! 🧵
I wrote last year about the un-seriousness with which Nucleus approached their IQ product and the damage it could do to genetic prediction and research more broadly (). This appears to have been a broader pattern beyond IQ, extending even to rare disease.theinfinitesimal.substack.com/p/genomic-pred…
People who care about this technology should be furious at Nucleus and their collaborators (as well as Orchid and Genomic Prediction for their own errors). Finding such flaws should not require reverse-engineering by a competitor. These products clearly need independent audits.
Oof. Polygenic scores for IQ lose 75% of their explained variance when adding family controls, even worse than the attenuation for Educational Attainment. These are the scores Silicon Valley is using to select embryos 😬.
The TEDS cohort used here is a very large study with high-quality cognitive assessments collected over multiple time points. It is probably the most impressive twin study of IQ to date. That means very little room for data quality / measurement error issues.
It is important to highlight surprising null results. Just last week we were hypothesizing that large IQ score attenuation could be a study bias or an artifact of the Wilson Effect. Now we see it replicate in an independent study with adults.
@notcomplex_ @krichard1212 The authors fit a non-identifiable Model B, which produces a table full of NA's. Then they try to interpret this model to fix it. That makes no sense. The parameters of this model will be completely arbitrary, so using it to decide what to prune is also statistically invalid.
@notcomplex_ @krichard1212 At various points later on they talk about "Heywood cases", which are out-of-bounds parameters or negative variances, but no such out-of-bounds parameters are actually present in the tables (and, again, you cannot interpret these from the non-identified model).
@notcomplex_ @krichard1212 So none of the decisions make statistical sense and either reflect someone who doesn't know what they're doing or is intentionally trying to find the model fit they like. True to form given they missed a fatal error with model A, misinterpreted AIC comparisons, etc.