Lior Pachter Profile picture
May 2 17 tweets 6 min read Twitter logo Read on Twitter
In a recent preprint with @GorinGennady (biorxiv.org/content/10.110…) we provide a quantitative answer to to this question, namely what information about variance (among cells in a cell type, or more generally many cell types) does a UMAP provide? A short🧵1/
The variability in gene expression across cells can be attributed to biological stochasticity and technical noise. In practice it's hard to break down the variance into these constituent parts. How do we know what is biological vs. technical? 2/
Here's an idea: within a cell type, we can obtain an accurate estimate of gene expression by averaging across cells. Now we can get a lower bound for biological variability by computing the variance across very distinct cell types. 3/
We can now assess whether a transformation of a count matrix is removing only technical noise, or throwing out some biological signal by accident. The bound means one should end up in the green region. Throwing out too much variance will place a gene outside of it. 4/ Image
Consider the two first steps in every single-cell RNA-seq analysis:
1. Depth normalization (in the fig. below "PF" for proportional fitting).
2. Log-transform, i.e. log(x+1) of the data (in the fig. below "log")
Red points are the top highly expressed genes. 5/ Image
What you can see is that depth normalization is ok. It adds a bit of noise (some genes have more variance than they started with) but not too bad. However the log transform drops many genes out of the admissible zone. That is, the transform has removed biological signal. 6/
The intuition behind variance stabilization, or normalization, namely that it's removing technical noise due to sampling etc., is not quite right. Yes, the transform is reducing technical noise, but it's throwing out some of the baby with the bathwater. 7/
Next the standard step is PCA. We see that even more biological signal is removed. The final step is UMAP, which adds tons of noise. Some of the points move back above the green line, but remember that biological signal has already been removed. It's just addition of noise. 8/ Image
In other words, UMAP is the worst of all worlds. The procedure does not remove technical noise, it adds it in to the data. And for no reason: all one gets for noising the data is distortion of distances. 9/
So what should one do instead? In our preprint we present Monod, software for *modeling* both of the biophysics and technical noise. This way, one doesn't have to throw out the baby with the bathwater, because one works to figure out the differences between them. 10/ Image
Another advantage of Monod: it provides a natural way to "integrate" or "harmonize" different data modalities. It provides an answer as to which matrix to choose in an analysis: spliced or unspliced? Rather than choosing between them, with Monod they are used together. 11/ Image
Furthermore, Monod generalizes differential expression testing to the identification of genes with distributional differences. We provide some interesting use cases, and also validate performance with respect to a recently published dataset from @Weinberger_Lab. 12/ Image
Monod is available here: github.com/pachterlab/mon… 13/
Tutorials and examples are provided here: github.com/pachterlab/mon… 14/
Documentation is here: monod-examples.readthedocs.io/en/latest/ 15/
More on @GorinGennady related work in this thread:

Also a once-in-career opportunity to hire him: he is defending on May 19th. DM me for Zoom info if you're interested. 16/
In summary, don't start your #scRNAseq analysis by throwing out almost all of the biological signal in your data. And to return to your question @WallaceUcsf, the UMAP for a single cell type is not much more than a Rorschach test. 17/17

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lior Pachter

Lior Pachter Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @lpachter

May 2
Actually, not transforming the data outperforms log(y/s+1). 1/
The "performance" in this analysis boils down to checking consistency of the kNN graph after transformation. That's certainly a property one can optimize for, but it's by no means the only one. In fact, if it was the only property of interest, one could just not transform. 2/
Of course that is trivial and uninteresting. The purpose of normalization is to remove technical noise and stabilize variance. But then one should check how well that is done. And as it turns out, log(y/s+1) actually removes too much "noise". 3/
Read 6 tweets
May 2
In 2019 "Single-cell multimodal omics" was deemed @naturemethods Method of the Year, and since then many new multimodal methods have been published. But are there tradeoffs w/ multimodal omics?

tl;dr yes! An analysis w/ @sinabooeshaghi & Fan Gao in biorxiv.org/content/10.110… 🧵1/
There are a lot of ways to look at this question and we have much to say (long 🧵ahead!). As a starting point let's begin with our Supplementary Figure 4. This is a comparison of (#snRNAseq+#snATACseq) multimodal technology with unimodal technology. Much to explain here: 2/ Image
(a) & (b) are showing the mean-variance relationship for data from an assay for measuring RNA and TAC (transposable accessible chromatin) in the same cells. The data is from ncbi.nlm.nih.gov/geo/query/acc.…
Cells from human HEK293T & mouse NIH3T3 were mixed. You're looking at the RNA. 3/
Read 21 tweets
Mar 23
To follow up on this comment by @nilshomer, I wanted to say a few things about why @sinabooeshaghi designed and developed seqspec (just pre-printed here biorxiv.org/content/10.110…), and our hopes for how it can be used for transparency and reproducibility in genomics. 🧵1/
Since the development of sequence census assays by Barbara Wold in her pair of transformative papers in 2007--2008 on Chip-seq and RNA-seq (science.org/doi/10.1126/sc… and nature.com/articles/nmeth…), the use of sequencing for molecular biology has exploded. 2/
Wold and Myers predicted this explosion in 2008, writing "an exciting frontier is just beginning to emerge" and recognizing the importance of "being able to assay the regulatory inputs and outputs of the genome routinely and comprehensively" nature.com/articles/nmeth… 3/
Read 16 tweets
Jan 19
Interested in "integrating" multimodal #scRNAseq data? W/ @MariaCarilli, @GorinGennady, @funion10 & Tara Chari we introduce biVI, which combines the scVI variational autoencoder with biophysically motivated bivariate models for RNA distributions. 🧵 1/
biorxiv.org/content/10.110…
One of the clearest cases for "integration" is in combining measurements of nascent and mature mRNAs, which can be obtained with every #scRNAseq experiment. Should "intronic counts" be added to "exonic counts"? Or is it better to pick one or the other? 2/
This important question has been swept under the rug. Perhaps that is because it is inconvenient to have to rethink #scRNAseq with two count matrices as input, instead of one. How does one cluster with two matrices? How does one find marker genes with them? 3/
Read 23 tweets
Jan 2
This flippant comment on #scRNAseq algorithms reflects a common disrespect for computational biologists who are frequently derided for not asking "good biological questions". Moreover, it is peak chutzpah. A short 🧵..
As pointed out by @RArgelaguet, the OP recently coauthored a paper where many #scRNAseq methods, algorithms, and tools were used.. I wonder which of them the OP would have preferred was not developed. @AMartinezArias, please choose from this list:
Read 27 tweets
Dec 22, 2022
You have to hand it to Lex Fridman. His grift is not an amateur job. Take his Twitter photo. A professor standing in front of a blackboard with some math. Right?
This photo (see RHS of image below) is from what he calls his "MIT course" on Deep Learning for Self-Driving Cars. Sounds like good stuff. CS, math, self driving cars. #broheaven. So what is the problem? He is standing in front of the blackboard.
Well first of all, this was an MIT IAP class. IAP is a short period in January when students get to take fun classes on various topic that can be taught by anyone (many by students). I once sat in on a brain dissection. You can learn how to count cards. web.mit.edu/willma/www/mit…
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(