, 21 tweets, 4 min read Read on Twitter
A tweet thread on the difference between statistical modelling, statistical prediction and prediction of experiments - very relevant for human genetics here and now, but 3 closely related processes which are distinct
Statistical modelling is when we take observational data (say, UK BioBank), come up with a model (eg, a linear relationship between the numbers of alleles - 0,1,2, of small proportion of biallelic SNPs and the phenotype) and we do some sort of inference to fit our model to data
By inference I mean we've got some variables (eg the relationship of allelic dose to a phenotype) that we don't know before seeing the data, and we estimate them. Due to long tradition we given the unknowns greek letters (beta is a favourite letter here - no strong reasons why)
If we're lucky we can not only say "ooh, we've fitted our model" but we can also verify that our some assumptions of our model are consistent with the data (eg, not only are most beta's very close to 0, probably "are" 0 in our model, but the statistics of these look "random")
This though doesn't tell us our model is right - indeed, *no* model is likely to be right, a more realistic question is "is our model useful" (in the famous - "all models are wrong, some models are useful" paradigm).
In human (and animal, and plant) genetics, done right, the non-zero betas point to (some of) the molecular biology behind the phenotype. A lot of the non-zero betas are a bit head scratching, but biology is super complex, so no surprise. Some is not, some tantalizing interesting
But this is statistical *modelling*, not statistical prediction. In prediction one wants to be able to predict a phenotype of an unseen person (or animal) with confidence. To do this we also have to set up a model - and a similar but not identical - model is often used
The similar model posits that the genetic effects sum together both as allelic dosage in a location, and between alleles (we need to predict the overall person, not each locus) is simply summed. It's the easiest model for genetics one can imagine writing down
One could fit this model locus by locus in the same inference for discovery, but that's not leveraging the fact that there are usually way more loci than patterns of alleles across loci - due to both the pedigrees (humans) breeding structure (farm animals + plants) + biology
For farm animal and plant breeding, their pedigrees are so tight and folded back on themselves the pedigree is dominating this lack of diversity; for humans it is both our insanely rapid expansion from Africa as well as the biological habit of only "switching" at certain points
This switching (recombination) happens in sperm and eggs and why we are a shuffle of our parents genome - or more accurately, a shuffle of our grandparent's genome. Chromosome 1 which I got from my dad is a shuffle of my grandfather's and my grandmother's Chromosome 1
This lack of statistical independence at our loci due to both pedigree (farm animals) or population expansion (humans) and the biology (specific switch point- recombination hotspots) - goes by the long name "linkage disequilibrium" and abbreviated "LD"
Given this is a big signal, in the "prediction" task this structure gets in the way of clean inference of neighbouring betas. One approach is to carefully work out which SNP is the "right" SNP (this is called "fine mapping"), but you need even more samples to do this
The alternative is exploit this structure, allowing one to estimate the overall prediction of an individual better than having to know each beta correctly. (Confusingly though we often call the relevant prediction variables beta. We shouldn't as it confuses people, but we do)
This is a good trick - it works really well for farm animals and plants - but it makes our prediction model more fragile. When we use it with humans we've got another "model" question to ask - is this model valid for my unseen person I want to predict?
As LD is sensitive to recombination points (broadly similar in humans worldwide) and population expansion and population history (by definition *not* similar as we move between humans) this is a real headache.
Personally I think we've got to bite the bullet if we are going to do prediction - and get enough samples from enough diverse people to nail everything (ideally; fine map all major causal loci).
There is another subtle gotcha between discovery and prediction - they are quite different tasks, and how one sets up samples, and manipulates data on the way in is different. Eg, for discovery, the fact that UK BioBank is about twice as healthy on average is annoying, but ok
But for prediction it means we've got another extenistential modelling issue - we *know* even a "random British person" is not the same as a "random person from UK BioBank" (let alone a random French person, or US or a random Kenyan). Notice --- this is not genetics necessarily
Even so, given careful set up of the problem (proper prediction), making sure we're not stacking the deck (so we have to use all the other easy to get predictors, eg, age, sex) and enough diverse samples we should be able to get to "robust prediction"
(there is a whole separate question whether robust prediction is clinically useful. I will park this divergent thread for a moment!). But statistical prediction of an individual is *not* the same as predicting the result of an experiment >>
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Ewan Birney
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!