Michael Love Profile picture
Apr 11 17 tweets 8 min read
Got an RNA-seq dataset with 50, 100, 200+ samples? Plug it into a differential expression tool and hope for the best? No! You need to consider QC, EDA, and modeling technical variation, or else risk generating spurious results. A thread on papers, methods, and best practices:
Short version:
1) look for outliers (QC) and technical variation with PCA plots
2) consider problems with confounding: model unwanted variation with methods like RUV / SVA / PEER
3) include technical factors in linear model, iterate with respect to positive and negative controls
This is commonly agreed upon. All of the main workflows for Bioconductor DE tools stress quality control and examination of EDA plots such as PCA before any statistical testing, see e.g.

f1000research.com/articles/4-1070
f1000research.com/articles/5-1408
f1000research.com/articles/5-1438
Of course it's not just RNA-seq but all omics data, where changes in the data generation process may affect the high throughput measurements, leading to modeling problems if left unadjusted

ncbi.nlm.nih.gov/pmc/articles/P…
It's really easy to generate a quick PCA plot for QC and EDA. For example, in DESeq2, once you have a dataset object, it's just 2 lines of code:

dds %>% vst %>% plotPCA
After removal of outliers, continue to examine any large scale structure that may affect estimation of effects of interest. This is basic linear regression: estimates can be biased and inference invalid unless confounding var's are included in the model (more on estimands later)
Let's pause here: note that nonparametric models won't save you from this problem. They may appear to save you temporarily due to lower power, but as n -> infty they will also reject the null due to shifts in the distribution from confounding, not biological effects of interest
Nonparametric models do provide robustness to model mis-specification. After seeing SAMseq do well in one of our F1000R evaluations in 2018, we've used NP methods in 3 of our recent method publications.

It's just that this benefit doesn't address technical variation.
In our lab we use the Bioconductor methods RUV and SVA for sequence count data, from @drisso1893 and @jtleek, whenever we have large n. We have both methods demonstrated in our workflow. For more details, see these methods papers:

ncbi.nlm.nih.gov/pmc/articles/P…
ncbi.nlm.nih.gov/pmc/articles/P…
They add only a few lines of code to your analysis, e.g. here is how we estimate RUV factors in our scripts. These factors are then included in the full and reduced models during testing.
What choice of K, the dimension of the technical variation? We've used iterative analysis, as suggested by the RUV method authors. We've described our use of RUV on large expression datasets here, in this work from @bhattac_a_bt and @AlinaHamilton

ncbi.nlm.nih.gov/pmc/articles/P…
Another interesting recent idea to estimating the dimension of technical variation was in this recent paper from @xzlab_org lab, where they propose to triangulate across DE and MR signal to optimize K, aiming for defining positive controls empirically

ncbi.nlm.nih.gov/pmc/articles/P…
On estimands: another important question is if confounding variables interact with treatment or exposure of interest, also rethinking what is the desired effect to be estimated. @sarah_reifeis described applying causal inference methods to DE analysis:

ncbi.nlm.nih.gov/pmc/articles/P…
You can also approach the problem of causal effects with confounding via matching, as in this paper from Heller et al:

pubmed.ncbi.nlm.nih.gov/19098026/

Or targeted estimators (TMLE), in this paper from @nshejazi et al

semanticscholar.org/paper/Variance…
On computational speed on large datasets, want to mention the great work of @const_ae and @wolfgangkhuber to speed up NB GLM in DESeq2 often by an order of magnitude. Easily called from within DESeq2:

ncbi.nlm.nih.gov/pmc/articles/P…
I've put together a short Rmarkdown document demonstrating basic QC and EDA steps applied to a large RNA-seq dataset (PCA plot, removal of outliers).

Also how to model technical variation using RUV and incorporate those into edgeR or DESeq2 analysis:

github.com/mikelove/preNi…
Thx to @amarcobio for conversation about issues modeling large datasets 🙏

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Michael Love

Michael Love Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @mikelove

Jul 7, 2020
New preprint from first author Scott Van Buren, we look at various aspects of quantification uncertainty for scRNA-seq counts: interval coverage, trajectory analysis, and DE testing. 1/7

biorxiv.org/content/10.110…
Last year, in the alevin publication, @k3yavi et al showed that assignment of all the reads in scRNA-seq was critical for accurate estimation of abundance across categories of genes by uniqueness. 2/7

genomebiology.biomedcentral.com/articles/10.11…
And in the Swish publication, @anqiz91 et al showed how bootstrap replicates from alevin could be incorporated into a SAMseq procedure for differential testing across groups of cells. 3/7

academic.oup.com/nar/article/47…
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(