Mark Ziemann🌈🌻 Profile picture
Feb 17, 2023 24 tweets 7 min read Read on X
I want to talk today about a methodological issue in #genomics research that has been around a long time but is still a major problem.
The reason is that today I reviewed another manuscript that has this exact problem.
First some background.
In genomics research we often do profiling of how genes are switched on and off in disease and development, and in these profiling tests we identify dozens to thousands of genes that could play a role in those processes
Gene names don’t tell us about their function. We could dig into the literature on them, but the lists are so big it takes too long. So we often use tools to summarise whether genes belonging to certain functional groups are over-represented. Image
This is called “enrichment analysis”, and is one of the most used techniques in computational biology and #Bioinformatics . In 2022 there were >22k PubMed papers with “enrichment analysis” in the abstract alone! Image
We have databases of functional classifications and softwares that do the statistical analysis. Long ago, the only way to do this type of analysis was with statistical computer languages like R. Image
Now, there are so many different tools for enrichment analysis, implemented in different computer languages, some using the command line, some with a graphical interface and now lots of different web-based tools.
This is great as it makes these tools more widely available. They are so easy to use, you just paste in the list of genes from your profiling data and *BOOM* in 5 seconds you get a list of functional categories that you can paste into your next manuscript or report. Image
But there is an important and subtle problem with all this, which I’ll unpack in an example. Let’s say you are interested in a candidate anticancer drug. You’re excited because the drug appears to halt cancer cell growth without affecting normal cells.
So you profile the gene expression with RNA-seq on cancer cells with and without the drug. You do some statistical analysis and get lists of up- and down-regulated genes. You punch these into your fav enrichment tools and it gives confusing results.
The enrichment analysis says cell proliferation is over-represented in both the up and down lists! How can that be?

It comes down to how the background for the gene list was defined!
Wait, what is a background?

It is a set of genes that were detected in the profiling assay. Remember the genes of interest can only appear differentially regulated, if they have some baseline expression in the cells/tissues you’re working with.
This is actually pretty important because the genome has 45,794 genes, and most of them are switched off in any one tissue or cell. Image
A cancer cell might express only 15,000 genes, and that set of “background” or “baseline” genes is going to be wildly different from other tissues and cell types, like skeletal muscle or brain.
When calculating over-representation, we need to ask “over-representation compared to what??”
If we use the list of 45,794 genes in the genome, then we’re assuming all of them have equal chance of entering the differentially expressed lists. That’s complete rubbish!
The background needs to be set properly, use the list of 15,000 genes as the background and the problem will be fixed! Then enrichment analysis will start making sense with the phenotypic observations, and be overall more reliable and informative. Image
If you use enrichment analysis in your work, and have not been made aware of this before, stop and read this 2016 piece from @metapredict
genomebiology.biomedcentral.com/articles/10.11…
So how bad are the consequences of this mistake?
In a recent paper we showed that this mistake can lead to 56% of enrichment results being false!
See Fig 4D
journals.plos.org/ploscompbiol/a…
How common is this problem? We did a screen of 197 articles and found only 8 that gave information on whether a background list was used. That's ~4%!!!
Many enrichment tools have no ability to even accept a custom background gene set! You can only use the whole genome as the background. Here I’m talking about maayanlab.cloud/Enrichr/ and geneontology.org These are tools with 1000’s of citations per year.
Given that the primary use for enrichment analysis is the interpretation of gene expression data (see PMID: 35263338 supp Fig 1B), we can confidently say that 90% of the time those tools are used, they generate FALSE results. Image
Any enrichment tool that doesn’t accept a background list is invalid and should never be presented as evidence of anything. Developers of these tools need to add a background as a MANDATORY part of the analysis, and force the end-users to think about why it might impact results.
End users can act now by switching over to tools that do take background lists. david.ncifcrf.gov and bioinformatics.sdstate.edu/go/ are two good options.
Use the background list and keep it for future reference. As the work isn’t reproducible without it.
When storing gene lists, do so carefully as spreadsheets are known to autocorrect gene names into dates. genomebiology.biomedcentral.com/articles/10.11…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Mark Ziemann🌈🌻

Mark Ziemann🌈🌻 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @mdziemann

Feb 17, 2023
I'm pissed. Today I reviewed a MS that made wild claims based off enrichment analysis that (a) did not
use a background gene set and (b) did not use FDR correction of pvalues. Oh and the supplement had Excel gene name errors. 🤦🤦🤦🤦🤦
Everyone knows peer review is broken. We have too many crappy studies and too few quality reviewers, and the result is mountains of crap getting into pubmed.
We need to reduce the number of papers that get sent for review. We should cap the number of first/senior author papers to 1-2 per year. Then we could rather focus more on quality!
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(