My Authors
Read all threads
COVID19 and molecular - mainly Genomics + Genetics research. Some big picture thoughts
First off, understanding the disease this virus causes is important. It is clear that it has very variable effects in different people; very age dependent, comorbidity related, men more at risk than women, but even inside all these categories, large variation.
Understanding the disease is not just about stratification, but also gives us pointers about what drugs to test (ideally drugs that have already been developed for another disease) or to make (that's a long haul).
So - measuring humans in a variety of ways who do get severe disease (and thus the contrast is dont or have mild disease) is going to be important. And boy can we measure humans - their genomes, methylation of their DNA in cells, RNA, metabolites, proteins and imaging.
But... this plethora of measurement comes with all sorts of headaches in analysis - many measurements in no way guarentees success and furthermore there are going to be a host of potentially misleading results, with apparently excitingly low pvalues
This thread is to help people structure their studies well and immunise people to shody results, of which I can already sense many coming <sigh>.
First advice - get your team right that does this. At a minimum you need someone from the clinical side who has a good clinical logistics brain - otherwise you can't get the numbers and quality of data *and* you need someone from the stats/genomics side with analysis experience
Adding in someone with infectious biology perspective, ideally viral infection, is also necessary- having a working background knowledge of what happens when- what to measure, when and what broad steps of infection is. Bonus points for bringing in an card-carrying epidemiologist
They are very few people who have all these skills. Don't pretend if you don't have them - you can't upskill quickly in particular to get skills because you've screwed it something up: people rarely blog about the big design/experiment/analysis mistakes they've made :)
Ideally all these people will be battlescared - they would have a number of studies under their belts from clinical work in the field. They will have lived through annoying artefacts, batch effects, cryptic stratification and other gotchas.
Now, turning to "meta-design" if you like. Unless you are in a consortium, you are unlikely to be able yourself, locally to know if your study will generalise even if you did get a good result. So - (a) build in a resampling strategy (b) buddy up with another site
Ideally buddy up with a site that is clearly not linked - not two hospitals same town sending samples to one place; you want two hospitals, different countries, broadly same protocol
(don't worry about credit. If you truly find a key aspect of COVID19 you will dine out on it for a long time along with whoever you buddy up with. Plenty of credit to go around for the group of people who make headway on this disease)
Get ethics started early, talk to your ethics head informally, and get that straight at the start. It should, in this situation, be reasonably straightforward but don't add it in at the end.
Next: sample contrast. Remember that the biggest variance of whether people get the disease is if they have exposure to the virus. This is ... huge ... and will stratify everything and create all sorts of weird biases. Therefore "normal" vs "disease" is not going to be great.
Mild vs Severe in the same hospital is far better, but remember to sample any biological sample *before* determining mild vs severe. Otherwise you will just have a self confirmatory loop of analysis. Germ-line DNA (which does not change) is the *only* thing you can sample late.
My rule of thumb sample size. For biomarker discovery (methylation, RNA, metabolites, CytoTOF) you want *at least* 50 samples in each category. The heterogeneity in people and in disease trajectory is going to be a nightmare and you need enough samples to get some handle on this
For genetics, due to the vast majority of alleles being <1% you need at least 1,000 cases and 1,000 controls - but probably far more due to this additional variance. Sadly I don't think we're going to be powered until we're in the 5,000 size. (sorry).
Don't think measuring more somehow improves your sample size. This is where the clinical part of the team with their clinical logistics brain is going to be critical. Remember also to get 50 samples you might have to recruit 100.
Obviously run good assays. Don't get clever on the assay - this is not the time to do method development. Do stuff your genomicist/multi-omics pro knows how to do. QC well - and don't be afraid to drop samples that look weird (do this before the analysis starts in earnest) but>>
<<do keep a note of the weird samples and check there is not something that pops out (sometimes it is the weird samples that give insight via a very different route than the experiment you planned. Biology is a complex thing).
You want an analyst who has worked with the datatype before. As well as the "standard" batch effects remember that the other covariate modelling is going to be complex - from the obvious - age - to the near impossible - social contacts
(this goes to why mild vs severe disease is going to be cleaner than disease vs absent. This is also where your epidemiologist team member will be great, and they will spend some time talking about "collider bias". Listen to them)
The process of deciding and modelling the covariates is complex and its a key part of the analysis. Attack it from both first principles and via data science (good ol' PCA, latent factors ...).
This will be complex and why a follow up study where you *just* aim to replicate findings you made in round1 is a good idea. I don't know anyone who can work out best - or good enough - covariates without data, but then you need a second independent set of data. (sorry).
In the analysis, as well as all this covariate analysis at the start you are likely to have just-on-the-edge-of-power datasets. You will need to be paranoid about multiple testing. Obviously there will be good use of FDR if you / your analyst is good (p.adjust to the rescue)
Even then, be more paranoid. Look for FDRs in the 0.01 range, not 0.1 - and *please* don't publish things with higher FDRs - almost certainly they wont be true. I give myself a 10-fold higher bar when datasets are super hetreogenous.
(hetreogenity is both annoying but is also means there are cryptic rare events that you don't know where they are)
Once done, remember the great Feynman quote "The first principle is that you must not fool yourself – and you are the easiest person to fool."
All this done? Got a result < 0.01 FDR? Want to publish? Ideally publish together with your buddy - already showing it generalises. If not, point to your buddy going to publish (but honestly, you will be better off one paper or back to back)
Don't feel the need to overclaim - if you are right, the first paper seeding the idea will be known. Say it needs replication and triangulation by other methods. Point out where you are worried about sample hetreogenity and how to replicate it.
And... be careful out there. We need some sort of sensible signal to noise on the science of this disease. <sermon ends>
Missing some Tweet in this thread? You can try to force a refresh.

Enjoying this thread?

Keep Current with Ewan Birney

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!