1/9 Every bulk RNA-seq experiment I run goes through the same 7 checks before I trust the results.
I've been burned enough times to know: if you skip QC, you will find out the hard way. Usually during a meeting with your collaborator.
Here's my checklist:
2/9 Check 1: FastQC + MultiQC on raw reads.
Before anything else. You're looking for adapter contamination, GC bias, per-base quality drops, and overrepresented sequences.
I've caught entire lanes of garbage data at this step. Five minutes that saves you days.
3/9 Check 2: Mapping rate.
After alignment (STAR, HISAT2, whatever you use), check the percentage of uniquely mapped reads. I want to see >70% for most human/mouse experiments.
Low mapping rate? Could be contamination, wrong reference genome, or your library prep went sideways.
4/9 Check 3: Gene body coverage.
Run RSeQC's geneBody_coverage.py. You want a roughly even signal across the gene body for poly-A enriched libraries.
Heavy 3' bias? Degraded RNA. Heavy 5' bias? Rare, but possible fragmentation issue. Either way, you need to know before you start calling DE genes.
5/9 Check 4: PCA plot. The most important plot in the entire analysis.
Do your samples cluster by the biology (treatment, condition) or by batch, lane, or extraction date?
If your PC1 is "which day the RNA was extracted," you have a batch effect problem. Fix it now or your DE results are noise.
6/9 Check 5: Sample-to-sample correlation heatmap.
Complements the PCA. Hierarchical clustering should group replicates together.
If one replicate clusters with the wrong group, you either have a sample swap or an outlier. I've caught mislabeled samples this way more than once.
7/9 Check 6: Library complexity / duplication rate.
Picard's MarkDuplicates or just check the duplication stats from STAR. High duplication (>60%) means you probably sequenced too little input material.
Your "20 million reads" might actually be 5 million unique reads. That changes everything for statistical power.
8/9 Check 7: Count distribution and filtering.
After quantification, look at the distribution of counts per gene. Filter low-count genes (I typically require >10 counts in at least n samples where n = your smallest group size).
Also check for genes driving >5% of total counts. One mitochondrial gene eating half your library is more common than you think.
9/9 I run these 7 checks on every single dataset. No exceptions.
It takes about 30 minutes for a typical experiment. I've written Snakemake pipelines that automate most of it.
The alternative is spending two weeks on a differential expression analysis, presenting results, and having someone ask "did you check for batch effects?" while you stare at the floor.
Ask me how I know.
I hope you've found this post helpful.
Follow me for more.
Subscribe to my FREE newsletter chatomics to learn bioinformatics divingintogeneticsandgenomics.ck.page/profile x.com/433559451/stat…
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.
