Let's define some terms. Race is a social categorization of people into groups, typically based on physical attributes. Genetic ancestry is a quantification of genetic similarity to a reference population. While correlated, they have fundamentally different causes & consequences.
We should care about causes, and race is a poor causal model of human evolution. In truth, genetic variation follows a "nested subsets" model, where all people eventually share ancestors, which is fundamentally different from race (see for yourself here: ). james-kitchens.com/blog/visualizi…
Formally, race-like models do not fit well to the genetic distances we observe even in highly geographically distinct populations with minimal admixture [Long et al 2009] (and as we'll see, mixture is very common). Trying to make a racial model work produces nonsense.
Consistent with nested subsets, a pair of individuals from Africa have more genetic differences than a pair from Africa/France, and the majority of those differences are *common* across all populations [Biddanda 2020]. Population-private common variants are very infrequent.
Notably, before genetic data, race advocates predicted that most variation would be homozygous within racial groups and highly divergent. The truth is the opposite even for ancestry: if you condition out population labels you are still left with 85% of the genetic variation.
But what about ancestry? Let's go through three core methods for analyzing ancestry in genetic data: dimensionality reduction (PCA), model-based clustering (STRUCTURE), and parametric model (admixture graphs). Because ancestry is relative, each approach has limitations.
1: PCA estimates eigenvectors of the sample relatedness matrix which, under simple ancestry models, are expected to recover population labels. Matrix theory shows that PCA is extremely sensitive with enough data, able to detect relationships down to a handful of generations.
But PCA is easily distorted by the sampling process: bigger populations will warp the PCA locations, even individuals within a single family can look like different populations. PCA also produces unusual artifacts when there is simple spatially locality in the data.
2: STRUCTURE clusters individuals as mixtures of a fixed set of (k) populations. It shares strengths with PCA (sensitive, interpretable) but also limitations (distorted by sampling and parameter choice).
[Lawson 2018] provide many examples where STRUCTURE misses known, uh, structure; merges divergent populations together; finds false admixture; or models ancient genomes as mixtures of modern ones.
3: Admixture Graphs use allele-sharing statistics to fit population drift, splits, and mixtures. A very powerful approach but identifiability is a challenge: is the graph you found is significantly better than all other possible graphs? [Maier 2023] show it often is not.
Got all that? Now let's look at some real data from biobanks. Again we see that race is a very poor model of human populations, which are continuous mixtures of multiple data sources with no clear boundaries or mapping to any folk racial constructs.
As predicted, PCA is sensitive. When we zoom in on populations we see clines all the way down: county-level correlations among self-reported British whites; down to neighborhood level correlations in Chinese and Japanese biobanks. Continuous relationships at all scales.
I sometimes see the argument that, even if race is flawed, genetic ancestry can tell us the "true" races. But this is clearly wrong. These methods depend on a sampling process we cannot know, and real data is full of the mixtures and continuous relationships we just saw.
This is even more apparent when looking at populations in history. Dynamic admixture, migration, and continuous structure is historically common and sometimes quite rapid. Geography-ancestry relationships have been rewritten countless times.
Yet again, historic models motivated in part by racial thinking presumed that populations largely evolved through "serial founder" events and developed in isolation. Genetic data shows us this is clearly not the case and our history is much more complicated.
Finally, in Africa, we see the limits of even our sophisticated computational models. Highly complex structure and gene flow can fit models of deep separation or continuous migration equally well, even when including ancient DNA. Our genetic history in Africa remains a mystery!
In sum, we use models to understand the causal processes in our world and race is a very poor causal model for genetics. But even models of genetic ancestry have fundamental limitations in light of our complex and dynamic, nested human history. Much more work to be done!
/fin
@threadreaderapp unroll
• • •
Missing some Tweet in this thread? You can try to
force a refresh
I've written the first part of a chapter on the heritability of IQ scores. Focusing on what IQ is attempting to measure. I highlight multiple paradoxical findings demonstrating IQ is not just "one innate thing".
First, a few reasons to write this. 1) The online IQ discourse is completely deranged. 2) IQists regularly invoke molecular heritability as evidence for classic behavioral genetics findings while ignoring the glaring differences (ex: from books by Ritchie and Haier/Colom/Hunt).
Thus, molecular geneticists have been unwittingly drafted into reifying IQ even though we know that every trait is heritable and behavior is highly environmentally confounded. 3) IQ GWAS have focused on crude factor models that perpetuate the "one intelligence" misconception.
It pains me to see facile critiques of GWAS on here from our clinical/biostats friends while the many actually good reasons to be critical of GWAS get little attention. So here's a thread on what GWAS does, what critics get wrong, and where GWAS is genuinely still lacking. 🧵:
Here’s an example of what I’m talking about from Frank Harrell’s otherwise excellent critique of bad biomarker analysis []. This gets GWAS completely wrong. Genome-wide significance is not about "picking winners" or "ranking" the losers. fharrell.com/post/badb/
Genome-wide significance is about identifying variants for which the estimated effect size is *accurate*. And since most traits are polygenic (meaning a large fraction of variants will have some non-zero association) this practically means getting effect *direction* right.
I’ve seen critiques of the poor methodology and cherry-picking in The Bell Curve but I haven’t seen much about the absolutely deranged fever dream of predictions about the coming decades in its closing chapters. It has been 30 years, so let's review. 🧵:
Low skill labor will become worthless, attempts to increase the minimum wage will backfire. In the not-too-distant future, people with low IQ will be a ”net drag” on society.
“Cognitive resources” in the inner city have already fallen “below the minimum level” and will escalate into a “fundamental breakdown in social organization”. “The Underclass” will become isolated and increasingly unable to function in the larger society.
Unpopular opinion (just look at the QT's) but nearly every "dogmatic, outdated, and misleading" claim about IQ listed here is either objectively accurate or heavily debated dispute within the field itself.
One way test bias is evaluated within the field is by testing for strong measurement invariance (i.e. that subtest behavior is consistent across groups). This method is almost never applied in the classic literature or applied poorly (MCV).
When MI is tested for, it fails often enough that test bias should be the first concern when doing any group comparisons [see Dolan et al. for some examples: ]. Test makers work hard to mitigate bias but intelligence researchers often do not.…ltewichertsdotnet.files.wordpress.com/2015/12/dolans…
Some thoughts on the ability to distinguish populations with genetic variation, why that means little for trait differences, and why there are other good reasons to collect diverse data. 🧵
I was pleasantly surprised to see no one mount a strong defense of "biological race" in this thread. Even the people throwing this term around seem to realize it's not supported by data. Instead the conversation shifts to population "distinguishability".
For example, a random twitterer (left) and a professor (right) emphasizing that genetic variation can be used to "distinguish" populations. And it's true, one can aggregate small per-variant differences into genetic ancestry estimates that often correlate highly with geography.
Something I don't want to get lost is that the field is much better now at studying, visualizing, and discussing complex populations than it has ever been, and there are many resources to help do this effectively. A few suggestions below:
The NAES report and interactive on using population descriptors [] and Coop on genetic similarity [].