Lior Pachter Profile picture
Jan 22, 2022 26 tweets 11 min read Read on X
Is a single-cell RNA-seq atlas really an atlas? A short thread about #scRNAseq, maps, and atlantes (yes, the plural of atlas is atlantes! h/t @NeuroLuebbert). 🧵1/
Atlantes must be accurate to be useful, and the vexing question for centuries, namely how to best represent the spherical earth in 2D, is nontrivial. There have been many proposals with pros & cons for each (because the sphere and the plane have different Gaussian curvatures). 2/
In #scRNAseq, atlases of cells have become synonyms with UMAP figures of gene expression matrices (used to be t-SNE but UMAP seems more popular now). Map making from gene expression matrices is more challenging than map making of our 3D world; #scRNAseq is in ~10⁴ dimensions. 3/
Mathematician George Pólya gave the following advice: "If you can't solve a problem, then there is an easier problem you can't solve: find it." This has been ignored in #scRNAseq, which wouldn't matter, except the method used for the general case fails on the simplest one.4/
Below is an example from a simple case. It's UMAP of a group of cells that are not in some huge dimension; here there are only 3 genes. The data was clustered with the popular "Leiden" method. The figure *seems* ok with the visual more or less confirming the clustering. 5/
But what was the actual example, the "ground truth" that this "atlas" represents? These were points selected uniformly at random on the sphere. No actual structure whatsoever. You can see how the UMAPs look for varying parameters: 6/
The Leiden clustering was performed on the uniformly sampled points. Of course the clusters consists of points that were close together, but their boundaries and shapes are meaningless... the points were sampled (densely) uniformly at random... 7/
How do people currently select parameters for the UMAPs they make? They tune with them until they get a picture that matches the clustering well...#confirmationbias 8/
You might wonder whether *any* of the choices of parameters produce a good map. All the atlantes are poor in this case. To see this, look what happens to an actual map of the world (points colored by continent ). Sometimes continents are broken apart, e.g. Africa in this case. 9/
Sometimes sea water is in mixed in with land (look at South America). 10/
No matter what parameters you choose, you'll see some semblance of the continents, but pretty much things are a mess. 11/
The chaos these projections can create is made clearer by omitting the ocean. Look at South American, which in reality is a "cell type" (continent) that is filled uniformly with cells, looking like a differentiation trajectory. 12/
At least in the above, South America is connected to North America. That is not always the case. 13/
Again, you'll find that varying parameters produces maps that, while in some cases better than others, all have major problems. 14/
UMAP author @leland_mcinnes describes it as "capturing the manifold underlying the data" by "stealing the singular set & geometric realization functors from algebraic topology & then adapting them to apply to metric spaces and fuzzy simplicial sets." 15/
umap-learn.readthedocs.io/en/latest/how_…
Well, the sphere is a manifold? What exactly has UMAP captured?

Look, I love algebraic topology but throwing fancy math words around doesn't make a method have good properties. One needs theorems for that. 16/
UMAP is not just randomly placing high-dimensional points in the plane. In benchmarks we've done we see it preserves some structure (). But it's overall a poor heuristic. Ask yourself: next time you fly would you want your pilot navigating with a UMAP? 17/
Biologists have pushed back on criticisms of UMAP by saying that (to paraphrase), "of course they are not used for analysis, they are just hypothesis generating plots and all predictions must be validated". First of all, UMAP is used for analysis: 18/
Second, considering how expen$ive most experiments are in biology, and how much time they take, are graduate students really spend years in a lab chasing a UMAP generated hypothesis to confirm that it is real? 19/
This thread has focused on UMAP, but it also highlights problems with clustering. Here is a Leiden clustering of the continents (from points uniformly sampled within them, displayed with Mercator projection). Not terrible, but is Africa really two continents? 20/
The interaction between UMAP and the clustering makes a reasonably good clustering much worse. That's because it magnifies small differences. In many parameter choices below, blue and yellow like like two separate clusters. There's a "novel" cell type right there! 21/
In addition to all of these problems with single-cell atlantes, is also the problem that they are not "canonical", the way one would like an atlas to be. 22/
What should one produce instead of UMAP atlantes? There are many useful ways to visualize information, even geographic information, that can yield great insight. Turning statistics into art can be challenging, but it's important and useful. No need to be lazy. 23/
This thread was motivated by discussions with @IngileifBryndis, and inspired in part by the beautiful animations of @JEFworks (see ). 24/
The UMAP analyses of this thread, and their visualizations and animations, were produced by @LambdaMoses. Her code used to make the figures is available here: github.com/lambdamoses/um… 25/25
? -> .
(annoying typo, the point is yes, the sphere is a manifold).

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lior Pachter

Lior Pachter Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @lpachter

May 6
For the second day of the week of observance of the Days of Remembrance of the Victims of the Holocaust a 🧵 about Sosúa.

Sosúa is a small beach town in the Dominican Republic that was founded by Jews fleeing Nazis in Europe in 1940. 1/
Sosúa is a beautiful place in Puerto Plata on the north coast of the Dominican Republic. About 56,000 people live there now.

But Dominican Republic? How did Jews end up founding a beach town in the Dominican Republic? How many Jews?

2/ Image
In 1938 a conference was held in Évian, France to discuss what to do about Jewish & Austrian refugees trying to flee persecution by the Nazis.

This is the same Évian of evian water. The company was founded in 1859 and was selling bottled water by 1908. But I digress.. 3/ Image
Read 9 tweets
Apr 14
It's been great to see the positive response of @satijalab & @fabian_theis to our preprint on Seurat & Scanpy, and their commitment to work to improve transparency of their tools. One immediate benefit will be better practice of PCA in genomics. 1/🧵biorxiv.org/content/10.110…
PCA became a mainstay in genomics after the papers of @soumya_boston, Josh Stuart & @Rbaltman () and @OrlyAlter () ca. 2000 demonstrated its power for studying gene expression. 2/worldscientific.com/doi/abs/10.114…
pnas.org/doi/10.1073/pn…
Back then, having linear algebra on one's side was essential. A rich lab at that time might have something like a Sun Blade workstation clocking ~500MhZ w/ 2Gb RAM. So having fast SVD algorithms made PCA practical, when other methods based on more sophisticated models weren't. 3/ Image
Read 19 tweets
Apr 7
The difference in @10xGenomics' Cell Ranger's default between version 6 and 7 is discussed in this thread, but it's such a big deal that it's worth its own thread.

tl;dr: in v7 Cell Ranger changed how it produces the gene count matrix leading to a huge difference in results. 1/
The change was described in release notes on May 17, 2022, which via two clicks lead to a technical note with more detail: 2/ cdn.10xgenomics.com/image/upload/v…
Image
To understand this technical note it is helpful to be familiar with the three types of reads that are produced in single-cell RNA-seq: spliced (M as a proxy for mature mRNAs), unspliced (N as a proxy for nascent RNAs), and ambiguous between both (labeled A). 3/ Image
Read 15 tweets
Apr 5
The choice of whether to use Seurat or Scanpy for single-cell RNA-seq analysis typically comes down to a preference of R vs. Python. But do they produce the same results? In w/ @Josephmrich et al. we take a close look. The results are 👀 1/🧵 biorxiv.org/content/10.110…
Image
We looked at a standard processing / analysis summarized in the figure below. The sources of variability we explored are in red. The plots and metrics we assessed are in blue. We examined the standard benchmark 10x PBMC datasets, but results can be obtained for other data. 2/ Image
Before getting into results it's important to note that Seurat has never been published, and many of the details of Scanpy are missing in its original paper. @Josephmrich read the code & traced every function and every parameter. E.g., this is how Clustering / UMAPs are made: 3/ Image
Read 25 tweets
Feb 21
My blog passed 3 million views today from more than 1.8 million visitors. There have been a total of 119 posts in just over 10 years.
I'm one of those visitors. The blog is an idea repository and I go back sometimes for recall. Some highlights 1/🧵 liorpachter.wordpress.com
Image
Just today I revisited the PCA post to recall some of the properties of the transform. A student, Nick Markarian, taught me the Borel-Kolmogorov paradox today (topic for a future post) and the post was helpful in thinking about some things. 2/ liorpachter.wordpress.com/2014/05/26/wha…
I've been teaching a bit of phylogenetics this year and this post on the Golden-Thompson inequality just came up. 3/liorpachter.wordpress.com/2018/10/05/rat…
Read 24 tweets
Dec 24, 2023
This year I had the privilege of enjoying in-person conferences again, and in April I met @dvir_a & Dan Gorbonos, from whom I learned a bunch of interesting science. Here we are having burgers at Hans im Glück in Bonn.
And now, a 🧵about genocide.. 1/
The topic came up at dinner. History presents a heavy burden for Jews in Bonn.. even 78 years after WWII. The "Hans in luck" restaurant we were dining at is just a few meters from where the local synagogue was burned down during "Kirstallnacht" in 1938. 2/ Image
Although decades have passed since the holocaust, in Bonn the events felt closer in time. We were attending the Bonn Conference on Mathematical Life Sciences, which held a moment of silence for Holocaust Remembrance Day while we were there. 3/
Read 51 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(