Lior Pachter Profile picture
Jan 22 26 tweets 11 min read
Is a single-cell RNA-seq atlas really an atlas? A short thread about #scRNAseq, maps, and atlantes (yes, the plural of atlas is atlantes! h/t @NeuroLuebbert). 🧵1/
Atlantes must be accurate to be useful, and the vexing question for centuries, namely how to best represent the spherical earth in 2D, is nontrivial. There have been many proposals with pros & cons for each (because the sphere and the plane have different Gaussian curvatures). 2/
In #scRNAseq, atlases of cells have become synonyms with UMAP figures of gene expression matrices (used to be t-SNE but UMAP seems more popular now). Map making from gene expression matrices is more challenging than map making of our 3D world; #scRNAseq is in ~10⁴ dimensions. 3/
Mathematician George Pólya gave the following advice: "If you can't solve a problem, then there is an easier problem you can't solve: find it." This has been ignored in #scRNAseq, which wouldn't matter, except the method used for the general case fails on the simplest one.4/
Below is an example from a simple case. It's UMAP of a group of cells that are not in some huge dimension; here there are only 3 genes. The data was clustered with the popular "Leiden" method. The figure *seems* ok with the visual more or less confirming the clustering. 5/
But what was the actual example, the "ground truth" that this "atlas" represents? These were points selected uniformly at random on the sphere. No actual structure whatsoever. You can see how the UMAPs look for varying parameters: 6/
The Leiden clustering was performed on the uniformly sampled points. Of course the clusters consists of points that were close together, but their boundaries and shapes are meaningless... the points were sampled (densely) uniformly at random... 7/
How do people currently select parameters for the UMAPs they make? They tune with them until they get a picture that matches the clustering well...#confirmationbias 8/
You might wonder whether *any* of the choices of parameters produce a good map. All the atlantes are poor in this case. To see this, look what happens to an actual map of the world (points colored by continent ). Sometimes continents are broken apart, e.g. Africa in this case. 9/
Sometimes sea water is in mixed in with land (look at South America). 10/
No matter what parameters you choose, you'll see some semblance of the continents, but pretty much things are a mess. 11/
The chaos these projections can create is made clearer by omitting the ocean. Look at South American, which in reality is a "cell type" (continent) that is filled uniformly with cells, looking like a differentiation trajectory. 12/
At least in the above, South America is connected to North America. That is not always the case. 13/
Again, you'll find that varying parameters produces maps that, while in some cases better than others, all have major problems. 14/
UMAP author @leland_mcinnes describes it as "capturing the manifold underlying the data" by "stealing the singular set & geometric realization functors from algebraic topology & then adapting them to apply to metric spaces and fuzzy simplicial sets." 15/
umap-learn.readthedocs.io/en/latest/how_…
Well, the sphere is a manifold? What exactly has UMAP captured?

Look, I love algebraic topology but throwing fancy math words around doesn't make a method have good properties. One needs theorems for that. 16/
UMAP is not just randomly placing high-dimensional points in the plane. In benchmarks we've done we see it preserves some structure (). But it's overall a poor heuristic. Ask yourself: next time you fly would you want your pilot navigating with a UMAP? 17/
Biologists have pushed back on criticisms of UMAP by saying that (to paraphrase), "of course they are not used for analysis, they are just hypothesis generating plots and all predictions must be validated". First of all, UMAP is used for analysis: 18/
Second, considering how expen$ive most experiments are in biology, and how much time they take, are graduate students really spend years in a lab chasing a UMAP generated hypothesis to confirm that it is real? 19/
This thread has focused on UMAP, but it also highlights problems with clustering. Here is a Leiden clustering of the continents (from points uniformly sampled within them, displayed with Mercator projection). Not terrible, but is Africa really two continents? 20/
The interaction between UMAP and the clustering makes a reasonably good clustering much worse. That's because it magnifies small differences. In many parameter choices below, blue and yellow like like two separate clusters. There's a "novel" cell type right there! 21/
In addition to all of these problems with single-cell atlantes, is also the problem that they are not "canonical", the way one would like an atlas to be. 22/
What should one produce instead of UMAP atlantes? There are many useful ways to visualize information, even geographic information, that can yield great insight. Turning statistics into art can be challenging, but it's important and useful. No need to be lazy. 23/
This thread was motivated by discussions with @IngileifBryndis, and inspired in part by the beautiful animations of @JEFworks (see ). 24/
The UMAP analyses of this thread, and their visualizations and animations, were produced by @LambdaMoses. Her code used to make the figures is available here: github.com/lambdamoses/um… 25/25
? -> .
(annoying typo, the point is yes, the sphere is a manifold).

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lior Pachter

Lior Pachter Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @lpachter

Jan 22
Grigory Pereleman was making less than $100/month while working on the solution of the Poincaré conjecture at the Steklov Institute. He won't live forever, but his ideas will.
newyorker.com/magazine/2006/…
His critiques of mathematicians hold true for scientists more generally: "...there are many mathematicians who are more or less honest. But almost all of them are conformists. They are more or less honest, but they tolerate those who are not honest."
The @NewYorker piece is filled with good quotes and anecdotes that many could learn from.

"Speed means nothing. Math doesn’t depend on speed. It is about deep." - Yuri Burago. This is so so true, and not just for math.
Read 7 tweets
Oct 7, 2021
The 17 #BICCN @nature papers on the primary motor cortex in mouse (+some human & marmoset) that were published yesterday are a major step forward in terms of open science for an @NIH consortium. For reference, links to the open access papers are here: nature.com/collections/ci… 1/🧵
First, the #BICCN required preprints of all the papers to be posted on @biorxivpreprint, and as a result the papers were already online 1-1.5 years ago. Of course the final versions now published have been revised in response to peer review. 2/
Speaking of peer review, almost all the papers were published along with the reviews. In combination with the preprints, this provides an unprecedented view of how consortium work is reviewed and how authors respond. Real data for this perennial debate: 3/
Read 17 tweets
Sep 29, 2021
In 2008, as a new professor of molecular and cell biology @UCBerkeley I presented at a seminar series intended to introduce 1st year students to research in the department. Two profs. presented each time, with food beforehand. I was paired with Thai food and Peter Duesberg. 2/
I knew of Peter Duesberg and his HIV/AIDS denialism, but I hadn't realized that he worked @UCBerkeley. We were now colleagues in the same department. 😱 3/
Read 14 tweets
Sep 24, 2021
I like the reproducibility standards for machine learning in the life sciences by @autobencoder, @michaelhoffman, @markowetzlab, @suinleelab, @GreeneScientist & @stephaniehicks but I propose an additional platinum standard for one click reproducibility.1/
By "one click", I mean that the entire analysis be reproducible in a (free) interactive online session of @colab (or other similar service). All steps of the analysis, from downloading data to generating figures are then not only automated but accessible for users. 2/
For an example of what this entails and facilitates, see: pachterlab.github.io/CWGFLHGCCHAP_2… 3/
Read 7 tweets
Sep 22, 2021
In response to questions & comments by @hippopedoid, @adamgayoso, @akshaykagrawal et al. on "The Specious Art of Single-Cell Genomics", Tara Chari & I have posted an update with some new results. Tl;dr: definitely time to stop making t-SNE & UMAP plots.🧵biorxiv.org/content/10.110…
In a previous thread I talked about the (von Neumann) elephant in the dimension reduction room: t-SNE & UMAP don't preserve local or global structure, they distort distances, and they are arbitrary. Almost everybody knows this but they are used anyway...
There were some interesting technical questions about our work. One question was the extent to which PCA pre-conditioning affects results. We examined this (Supp. Fig. 3). Tl;dr: it's time to stop making t-SNE & UMAP plots (with or without PCA pre-conditioning).
Read 20 tweets
Aug 27, 2021
It's time to stop making t-SNE & UMAP plots. In a new preprint w/ Tara Chari we show that while they display some correlation with the underlying high-dimension data, they don't preserve local or global structure & are misleading. They're also arbitrary.🧵biorxiv.org/content/10.110… Image
On t-SNE & UMAP preserving structure: 1) we show massive distortion by examining what happens to equidistant cells and cell types. 2) neighbors aren't preserved. 3) Biologically meaningful metrics are distorted. E.g., see below: Image
These distortions are inevitable. Cells or cell types that are equidistant in high dimension must exhibit increasing distortion as they increase in number. Actually, UMAP and t-SNE distortions are even worse (much worse!) than the lower bounds from theory. ImageImage
Read 25 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(