Ming
Mar 15 15 tweets 3 min read Read on X
1/9 Every bulk RNA-seq experiment I run goes through the same 7 checks before I trust the results.

I've been burned enough times to know: if you skip QC, you will find out the hard way. Usually during a meeting with your collaborator.

Here's my checklist:
2/9 Check 1: FastQC + MultiQC on raw reads.

Before anything else. You're looking for adapter contamination, GC bias, per-base quality drops, and overrepresented sequences.

I've caught entire lanes of garbage data at this step. Five minutes that saves you days.
3/9 Check 2: Mapping rate.

After alignment (STAR, HISAT2, whatever you use), check the percentage of uniquely mapped reads. I want to see >70% for most human/mouse experiments.

Low mapping rate? Could be contamination, wrong reference genome, or your library prep went sideways.
4/9 Check 3: Gene body coverage.

Run RSeQC's geneBody_coverage.py. You want a roughly even signal across the gene body for poly-A enriched libraries.
Heavy 3' bias? Degraded RNA. Heavy 5' bias? Rare, but possible fragmentation issue. Either way, you need to know before you start calling DE genes.
5/9 Check 4: PCA plot. The most important plot in the entire analysis.

Do your samples cluster by the biology (treatment, condition) or by batch, lane, or extraction date?
If your PC1 is "which day the RNA was extracted," you have a batch effect problem. Fix it now or your DE results are noise.
6/9 Check 5: Sample-to-sample correlation heatmap.

Complements the PCA. Hierarchical clustering should group replicates together.

If one replicate clusters with the wrong group, you either have a sample swap or an outlier. I've caught mislabeled samples this way more than once.
7/9 Check 6: Library complexity / duplication rate.

Picard's MarkDuplicates or just check the duplication stats from STAR. High duplication (>60%) means you probably sequenced too little input material.
Your "20 million reads" might actually be 5 million unique reads. That changes everything for statistical power.
8/9 Check 7: Count distribution and filtering.

After quantification, look at the distribution of counts per gene. Filter low-count genes (I typically require >10 counts in at least n samples where n = your smallest group size).
Also check for genes driving >5% of total counts. One mitochondrial gene eating half your library is more common than you think.
9/9 I run these 7 checks on every single dataset. No exceptions.

It takes about 30 minutes for a typical experiment. I've written Snakemake pipelines that automate most of it.
The alternative is spending two weeks on a differential expression analysis, presenting results, and having someone ask "did you check for batch effects?" while you stare at the floor.

Ask me how I know.
I hope you've found this post helpful.

Follow me for more.

Subscribe to my FREE newsletter chatomics to learn bioinformatics divingintogeneticsandgenomics.ck.page/profile x.com/433559451/stat…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Ming "Tommy" Tang

Ming

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @tangming2005

May 28
1/ You can't bolt AI onto chaos.
In biotech, if your data is a mess, your AI won't save you.
Build the data strategy first. Here's how. Image
2/
Real-world data isn't AI-ready.
Without structure, governance, and clarity, it’s noise.
AI needs fuel. And that fuel is clean data.
3/
At a biotech startup, we learned this the hard way.
Here’s what I took from a panel and years of practice.
The essentials:
Governance

Management

Metadata

Team dynamics

Tool choices
Read 21 tweets
May 21
1/ AI won’t save sloppy science.
Before you dive into deep learning, master your foundations.
Here’s why basic bioinformatics still rules 🧵 Image
2/
AI is flashy. But the core skills—UNIX, plotting, EDA—are what let you trust your data.
Without them? You’re flying blind.
3/
UNIX isn’t sexy.
But it’ll save your life when you’ve got 100 samples and need to rename, reformat, or reprocess them—fast.
Read 13 tweets
May 20
Anthropic just published "the single most important workflow for using Claude Code." It is four steps: Explore, Plan, Code, Commit.

Every bioinformatician I know who is good at their job has been doing this for years. Just without the AI part. Here is why it maps so cleanly.
Explore.

For Claude Code: read the relevant files before touching anything. Understand what exists. Map the dependencies.
For bioinformatics: look at the data before you analyze it. Plot the distributions. Check the metadata. Count the NAs. Ask the wet-lab person what they actually did. Read the existing pipeline.
Read 16 tweets
May 12
One of the best Claude Code feature is auto mode.

It is a classifier that decides which permission prompts you actually need to see. Safe reads and routine commands run without interrupting you. Anything that looks risky still gets blocked and surfaced for approval.
If you have ever felt like Claude Code is asking you to approve `ls` for the hundredth time today, this is for you.
Before auto mode there were two bad choices.

Approve everything one prompt at a time and spend half your day clicking yes. Or run with `--dangerously-skip-permissions` and pray the model never decides to be creative about which directory to delete from.
Read 12 tweets
May 11
Claude Code's /ultraplan is one of the AI feature in a while that actually changed my workflow instead of just speeding it up.

btw, I always use /plan for a new task. /ultraplan is different. Image
You ask for a plan from your CLI. It gets drafted in the cloud.

You keep coding. A few minutes later you tab over to your browser and the plan is sitting there, and you can highlight any sentence and leave a note on it.
That's it. That's the whole pitch. And it's better than it sounds.

I had not realized how bad the chat interface is for planning until I stopped using it.
Read 11 tweets
May 10
Claude Code shipped /ultrareview and almost nobody is talking about what's actually new about it.

It's not "AI reviews your code." We had that.
It's a fleet of reviewer agents that run in the cloud, find bugs in parallel, and then independently reproduce and verify every finding before showing it to you.

Verification is the part everyone is missing.
Single-agent code review has a known weakness: the model decides what to focus on, and you get whatever it noticed.

If it spent its attention budget on naming, you don't hear about the security bug.
Read 15 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(