Lior Pachter Profile picture
Aug 23 27 tweets 9 min read Twitter logo Read on Twitter
A 🧵 on why Seurat and Scanpy's log fold change calculations are discordant. 1/

(based on the Supplementary Notes from ). biorxiv.org/content/10.110…

Image
I first became aware of the discrepancy in LFC reporting by Seurat and Scanpy from a preprint by @jeffreypullin and @davisjmcc:

The result seemed surprising because it didn't seem like calculating log(x/y) = log(x)-log(y) should be complicated. 2/
So what gives?

We have molecule counts X_{ig} where i ranges over cells & g over genes, and we consider two groups of cells G_1 and G_2 containing n_1 and n_2 cells respectively. Let's start with Seurat which calculates LFC according to the formula below. But what is Y_{ig}? 3/ Image
Y_{ig} is the logarithm of the depth-normalized counts X_{ig}, specifically given by the formula below (where S_i is the total number of counts in cell i, scaled by a factor of 10,000 and prior to the log a pseudocount of 1 is added). Why this formula? 4/ Image
The formula for Y_{ig} is abbreviated as log2(CP10k+1), and its a "normalization" of the data. First, the division of X_{ig} by S_i is intended to scale counts across cells so that cell counts can be compared to each other. There's an extra factor of 10k- more on that later. 5/
The logarithm is intended to "stabilize the variance". In simple terms, the goal is to transform the counts so that genes with more counts don't have more variance across cells (a relationship that is observed in #scRNAseq). Why specifically the logarithm transform? 6/
Tl;dr: if the distribution of counts across cells is Poisson, the square root transform is optimal and if the distribution is negative binomial the logarithm is (a good approximation to the) optimal. This was worked about by Anscombe in 1948 7/ jstor.org/stable/2332343
The mean-variance relationship in #scRNAseq looks quadratic, i.e. there seems to be a negative-binomial distribution of counts for a gene across cells. Thus the log(CP10k+1) formula. Some examples generated by @vallens below: 8/ Image
From Anscombe's work it turns out that the 10k factor AND the +1 pseudocount in the log2(CP10k+1) formula are important. They are constants that should not be set arbitrarily, but rather according to the steepness of quadratic relationship between mean and variance. 9/
In Seurat the defaults are fixed, probably arrived at by trial and error, and from the practical need to simply avoid plugging in a 0 into the log function. But I digress... we've started to go down a variance stabilizing rabbit hole... what's the relevant to the LFC? 10/
Well the exponential removes the log2, and then the -1 gets rid of the +1 that was used in the log2(CP10k+1) formula (LHS). But now, in order to avoid plugging in a 0 into the log, another +1 pseudocount was added back in (RHS). 11/
Image
Image
So instead of computing log(x/y) what's being computed is log((x+1)/(y+1)). If x & y are both large, the results will be similar, but not when x and/or y are small. The thing is X_{ig}/S_i can be very small and we see that low-expressed genes are disproportionately affected. 12/ Image
So in Seurat, a lack of care in setting the pseudocount in the log2(CP10k+1) formula, and probably a lack of understanding as to its meaning (via Anscombe 1948), led to a large pseudocount erroneously being applied in the LFC calculation.

How about Scanpy? 13/
Fortunately Scanpy used a tiny pseudocount to avoid this problem. In their formula, epsilon is 10^(-9). That's why we see that for almost all genes, Scanpy LFC is higher than Seurat LFC. 14/ Image
But what is Scanpy doing? It exponentiates the mean rather than taking the mean of the exponent. Since log(x)+log(y) = log(xy), Scanpy is effectively computing the geometric mean of the X_{ig}/S_i+1 values then taking the log. Seurat uses arithmetic mean. Which is right? 15/
Well the log of the geometric mean is the MLE for the mean of log-normal distributed data, so it would make sense to use the geometric mean if the exp(Y_{ig}) were normally distributed. Strangely, the Scanpy authors don't think this is the case! 👀 16/
In they write "”scRNA-seq data are not in fact log-normally distributed.”
17/embopress.org/doi/full/10.15…
There's another problem: there's a -1 after the exponentiation. That's to fix the fact that the geometric mean has been computed on the X_{ig}/S_i + 1. But the geometric mean of (x_1+1,x_2+1),...(x_n+1) - 1 is not the same as the GM of x_1,x_2,...x_n unless the x_i are equal. 18/ Image
This mistake is less of a big deal, because the two formulas should in practice yield similar results, but it's an error and there's no reason the Y_{ig} couldn't have been exponentiated (just like in Seurat). 19/
In addition to these issues, Seurat and Scanpy are particularly discordant when one of the clusters (G_1 or G_2) has a gene with expression 0. The pseudocount differences really matter then (hence the outlier blue genes). 20/
Image
Image
BTW, I haven't mentioned scran, but it computes the LFC for use in the t-test and binomial test, and it uses two different formulas for the two tests (!) Details in our supplement. 21/ biorxiv.org/content/10.110…
So how should this hot mess be fixed? The solution is not just to tweak the arithmetic of the LFC in Seurat and Scanpy et al. The entire workflows of these platforms need to be revisited. For example, variance stabilization doesn't even make sense. 22/
In , w/ @goringennady and @johnjvastola we outline a modeling based approach for rethinking what it means to analyze #scRNAseq. 23/biorxiv.org/content/10.110…
Right now Seurat and Scanpy are primarily geared towards arriving at a UMAP (). That's not a meaningful goal, and the piling on of heuristics to get there does the field no favors... 24/journals.plos.org/ploscompbiol/a…
... as bad as the LFC discrepancy is, the p-valule discordance between Seurat and Scanpy is even worse...

Sleep tight. 😉

25/25 Image
@NadigAjay @GorinGennady Basically to make a UMAP people first run PCA. They don't really need to do this, but they do it because it's the standard advertised default workflow. PCA exists prior to UMAP because it was needed prior to running t-SNE (and UMAP has by and large replaced t-SNE).
@NadigAjay @GorinGennady And since variance increases with expression, used on raw data PCA will effectively be driven by the most highly expressed genes, rather than all genes.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lior Pachter

Lior Pachter Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @lpachter

May 2
Actually, not transforming the data outperforms log(y/s+1). 1/
The "performance" in this analysis boils down to checking consistency of the kNN graph after transformation. That's certainly a property one can optimize for, but it's by no means the only one. In fact, if it was the only property of interest, one could just not transform. 2/
Of course that is trivial and uninteresting. The purpose of normalization is to remove technical noise and stabilize variance. But then one should check how well that is done. And as it turns out, log(y/s+1) actually removes too much "noise". 3/
Read 6 tweets
May 2
In a recent preprint with @GorinGennady (biorxiv.org/content/10.110…) we provide a quantitative answer to to this question, namely what information about variance (among cells in a cell type, or more generally many cell types) does a UMAP provide? A short🧵1/
The variability in gene expression across cells can be attributed to biological stochasticity and technical noise. In practice it's hard to break down the variance into these constituent parts. How do we know what is biological vs. technical? 2/
Here's an idea: within a cell type, we can obtain an accurate estimate of gene expression by averaging across cells. Now we can get a lower bound for biological variability by computing the variance across very distinct cell types. 3/
Read 17 tweets
May 2
In 2019 "Single-cell multimodal omics" was deemed @naturemethods Method of the Year, and since then many new multimodal methods have been published. But are there tradeoffs w/ multimodal omics?

tl;dr yes! An analysis w/ @sinabooeshaghi & Fan Gao in biorxiv.org/content/10.110… 🧵1/
There are a lot of ways to look at this question and we have much to say (long 🧵ahead!). As a starting point let's begin with our Supplementary Figure 4. This is a comparison of (#snRNAseq+#snATACseq) multimodal technology with unimodal technology. Much to explain here: 2/ Image
(a) & (b) are showing the mean-variance relationship for data from an assay for measuring RNA and TAC (transposable accessible chromatin) in the same cells. The data is from ncbi.nlm.nih.gov/geo/query/acc.…
Cells from human HEK293T & mouse NIH3T3 were mixed. You're looking at the RNA. 3/
Read 21 tweets
Mar 23
To follow up on this comment by @nilshomer, I wanted to say a few things about why @sinabooeshaghi designed and developed seqspec (just pre-printed here biorxiv.org/content/10.110…), and our hopes for how it can be used for transparency and reproducibility in genomics. 🧵1/
Since the development of sequence census assays by Barbara Wold in her pair of transformative papers in 2007--2008 on Chip-seq and RNA-seq (science.org/doi/10.1126/sc… and nature.com/articles/nmeth…), the use of sequencing for molecular biology has exploded. 2/
Wold and Myers predicted this explosion in 2008, writing "an exciting frontier is just beginning to emerge" and recognizing the importance of "being able to assay the regulatory inputs and outputs of the genome routinely and comprehensively" nature.com/articles/nmeth… 3/
Read 16 tweets
Jan 19
Interested in "integrating" multimodal #scRNAseq data? W/ @MariaCarilli, @GorinGennady, @funion10 & Tara Chari we introduce biVI, which combines the scVI variational autoencoder with biophysically motivated bivariate models for RNA distributions. 🧵 1/
biorxiv.org/content/10.110…
One of the clearest cases for "integration" is in combining measurements of nascent and mature mRNAs, which can be obtained with every #scRNAseq experiment. Should "intronic counts" be added to "exonic counts"? Or is it better to pick one or the other? 2/
This important question has been swept under the rug. Perhaps that is because it is inconvenient to have to rethink #scRNAseq with two count matrices as input, instead of one. How does one cluster with two matrices? How does one find marker genes with them? 3/
Read 23 tweets
Jan 2
This flippant comment on #scRNAseq algorithms reflects a common disrespect for computational biologists who are frequently derided for not asking "good biological questions". Moreover, it is peak chutzpah. A short 🧵..
As pointed out by @RArgelaguet, the OP recently coauthored a paper where many #scRNAseq methods, algorithms, and tools were used.. I wonder which of them the OP would have preferred was not developed. @AMartinezArias, please choose from this list:
Read 27 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(