Lior Pachter Profile picture
Aug 27, 2021 37 tweets 12 min read Read on X
It's time to stop making t-SNE & UMAP plots. In a new preprint w/ Tara Chari we show that while they display some correlation with the underlying high-dimension data, they don't preserve local or global structure & are misleading. They're also arbitrary.🧵 https://t.co/dmFzD5RR6Rbiorxiv.org/content/10.110…
Image
On t-SNE & UMAP preserving structure: 1) we show massive distortion by examining what happens to equidistant cells and cell types. 2) neighbors aren't preserved. 3) Biologically meaningful metrics are distorted. E.g., see below: Image
These distortions are inevitable. Cells or cell types that are equidistant in high dimension must exhibit increasing distortion as they increase in number. Actually, UMAP and t-SNE distortions are even worse (much worse!) than the lower bounds from theory.
Image
Image
We find evidence of massive distortion in numerous datasets (we make the tools for this available). In practice, this means you can't make claims about datasets being the same or different based on a t-SNE or UMAP alone. We took a close look at this case:
UMAP applied to integrated data can make the data look more or less mixed than it actually is. The effect can go both ways! This is a result with recent data from the @jacob_hanna lab on recapitulating mouse embryogenesis ex utero: Image
None of this is a surprise. The Johnson-Lindenstrauss Lemma provides bounds for dimensions where low-distortion is possible: for 10,000 points you need >= 1,842 dimensions. There is a constant factor that gives some wiggle room... but it's far from 2!
Ok.. but.. maybe t-SNE & UMAP (or your favorite 2D viz) aren't perfect, but they are "canonical" and not arbitrary. Nope. They're just art. We developed Picasso for embedding your data into any shape, with less distortion than t-SNE & UMAP (see elephant at the start of the🧵) Image
This elephant is from an anecdote by Enrico Fermi, who once critiqued the complexity of a Freeman Dyson model by quoting John von Neumann: “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” We had more than that! https://t.co/ywRDBlNyxffermatslibrary.com/s/drawing-an-e…
Image
Picasso can produce quantitatively similar plots that are qualitatively very different- here is the same dataset as a world map and as von Neumann's elephant: Image
And here are two entirely different datasets both looking like von Neumann's elephant. These visualizations have similar properties to t-SNE and UMAP in terms of their fidelity to the high-dimensional distances. Image
Picasso is available via @GoogleColab so you can experiment with turning any dataset you like into any shape you want...while producing a better representation of the data than w/ t-SNE or UMAP. No more Rorschach testing necessary! Make your own elephant!
BTW, you don't even need to train the model (!) Picasso is based on a neural network, whose initialization with the Kaiming He method produces pretty good embeddings (see "no training" below). How is this possible? Image
We explain this in Supp. Note 4. Kaiming He is an adaptation of random linear projection using fixed Gaussians, which is a projection that can be used to prove the Johnson-Lindenstrauss Lemma. So even a neural-net w/o training is competitive w/ t-SNE/UMAP! openreview.net/forum?id=BJbXZ…
If t-SNE and UMAP et al. are just specious art, what should one do instead? We argue that instead of focusing on 2D visualizations, we should perform semi-supervised dimension reduction to higher dimension, customized to hypotheses / problems of interest. Image
We develop MCML (multi-class multi-label) dimensionality reduction for this purpose. We're far from the first to argue for semi-supervised learning for single-cell genomics applications, we're just jumping on the (right) train. See, e.g. @bidumit et al. nature.com/articles/s4146…
MCML is more general than existing approaches, and can be used with both discrete and continuous features. On a deeper level, we believe semi-supervised approaches can help us understand what the dimensions of transcriptomes, truly are. An important direction for future work. Image
This project was motivated by several recent papers, tweets we saw online, discussions, and prior projects in our lab. We started thinking about how "canonical" t-SNE & UMAP are after seeing this "map of Europe" styled picture of brain #scRNAseq.
2D planar maps of geometry on the sphere have some distortion, but they are also canonical w/ respect to a specific projection, i.e. there is a ground truth they represent in a canonical way. Image
But after exploring NCA for (supervised) dimensionality reduction to visualize cells with respect to a clustering in @sinabooeshaghi et al., we realized that 2D plots could be made to look much "cleaner" if one wanted, and were in a sense arbitrary. biorxiv.org/content/10.110…
It seemed that the "map of Europe" rendering of the brain was, in fact, arbitrary, and could easily be presented differently. Which is to say, this kind of comment about its "informative" value didn't ring true.
This ultimately led us in two directions: exploring the fidelity of t-SNE & UMAP visualizations, and separately the development of Picasso for making a point: single-cell genomics art is beautiful to look at, a good enough reason to make it, but it's art.
And this important paper by @jcjray and colleagues should be required reading:
Perhaps it's time for everyone to say out loud what we've all known for some time, but have had difficulty admitting: t-SNE, UMAP and relatives are just specious art and we risk fooling ourselves when we start to believe in the mirages they present.
@KeithComplexity It's still very far from been quantitative in the way in which people would like to think it is, and it is very much arbitrary if Picasso can outperform it on biologically motivated metrics.
@akshaykagrawal @KeithComplexity There is a theorem to go along with t-SNE and I had hopes for it (in terms of practice) but while it's an interesting theorem revealing connections to spectral clustering methods, this situation in practice is, well, see our preprint...
@akshaykagrawal @KeithComplexity I'm not aware of one. And in lieu of such a theorem, the kind of empirical analysis in our preprint, where we look carefully at biologically meaningful metrics, as well as difficult cases (e.g. equidistant points), makes sense as a way to get a handle on performance of a method.
@KenjiEricLee @TAH_Sci @sam_power_825 This is not to be flippant. We measured the distortion among neighbors, i.e. local structure, in the UMAP and it was very large. That's not good. Maybe the biological datasets don't satisfy the UMAP requirements. Maybe the UMAP heuristics are failing. I don't really know. Image
@KenjiEricLee @TAH_Sci @sam_power_825 But at the end of the day what we asked is not how much mathematics the UMAP developers mastered, or what their intention was. The question we looked at is whether UMAP is outputting what biologists think it is, and whether its output is suitable for the way they are using it.
@KenjiEricLee @TAH_Sci @sam_power_825 Unfortunately it isn't. And yet it has become ubiquitous. 🤷‍♂️
@KenjiEricLee @TAH_Sci @sam_power_825 Meanwhile not only is UMAP used quantitatively (e.g. see the mixing example in the thread), it has become the basis for applying other algorithms on top of it. See below:
@TAH_Sci @KenjiEricLee @sam_power_825 Of course UMAP may do perfectly fine for datasets with low intrinsic dimension, or that are just much simpler than what is encountered in biology, MNIST being a good example where I think the visualization is fine. But are we really trying to learn new things about MNIST?
@TAH_Sci @KenjiEricLee @sam_power_825 I think classifiers have achieved a 0.17% error rate on the dataset.
@KenjiEricLee @TAH_Sci @sam_power_825 @theosysbio This (from that paper) shows that even 10 nearest neighbors cannot be preserved. Image
@KenjiEricLee @TAH_Sci @sam_power_825 @theosysbio From the paper: "KNN is the fraction of 𝑘-nearest neighbours in the original high-dimensional data that are preserved as 𝑘-nearest neighbours in the embedding. We used 𝑘=10.. KNN quantifies preservation of the local, or microscopic structure."
@adamgayoso In fact one person critiqued us for doing PCA in the first place. It seems a good benchmark would be to vary PCA reduction from zero -> ambient and assess distortion between the embedding and the PCA reduced space, and PCA and ambient, in addition to what we're doing.
@adamgayoso Thanks for the feedback, btw. We'll look at some of this as we add to our initial results from all the comments we got.
The preprint is now published (with substantial reorganization, improvements, extensions in response to feedback from here and from reviewers) @PLOSCompBiol: journals.plos.org/ploscompbiol/a…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lior Pachter

Lior Pachter Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @lpachter

Apr 14
It's been great to see the positive response of @satijalab & @fabian_theis to our preprint on Seurat & Scanpy, and their commitment to work to improve transparency of their tools. One immediate benefit will be better practice of PCA in genomics. 1/🧵biorxiv.org/content/10.110…
PCA became a mainstay in genomics after the papers of @soumya_boston, Josh Stuart & @Rbaltman () and @OrlyAlter () ca. 2000 demonstrated its power for studying gene expression. 2/worldscientific.com/doi/abs/10.114…
pnas.org/doi/10.1073/pn…
Back then, having linear algebra on one's side was essential. A rich lab at that time might have something like a Sun Blade workstation clocking ~500MhZ w/ 2Gb RAM. So having fast SVD algorithms made PCA practical, when other methods based on more sophisticated models weren't. 3/ Image
Read 19 tweets
Apr 7
The difference in @10xGenomics' Cell Ranger's default between version 6 and 7 is discussed in this thread, but it's such a big deal that it's worth its own thread.

tl;dr: in v7 Cell Ranger changed how it produces the gene count matrix leading to a huge difference in results. 1/
The change was described in release notes on May 17, 2022, which via two clicks lead to a technical note with more detail: 2/ cdn.10xgenomics.com/image/upload/v…
Image
To understand this technical note it is helpful to be familiar with the three types of reads that are produced in single-cell RNA-seq: spliced (M as a proxy for mature mRNAs), unspliced (N as a proxy for nascent RNAs), and ambiguous between both (labeled A). 3/ Image
Read 15 tweets
Apr 5
The choice of whether to use Seurat or Scanpy for single-cell RNA-seq analysis typically comes down to a preference of R vs. Python. But do they produce the same results? In w/ @Josephmrich et al. we take a close look. The results are 👀 1/🧵 biorxiv.org/content/10.110…
Image
We looked at a standard processing / analysis summarized in the figure below. The sources of variability we explored are in red. The plots and metrics we assessed are in blue. We examined the standard benchmark 10x PBMC datasets, but results can be obtained for other data. 2/ Image
Before getting into results it's important to note that Seurat has never been published, and many of the details of Scanpy are missing in its original paper. @Josephmrich read the code & traced every function and every parameter. E.g., this is how Clustering / UMAPs are made: 3/ Image
Read 25 tweets
Feb 21
My blog passed 3 million views today from more than 1.8 million visitors. There have been a total of 119 posts in just over 10 years.
I'm one of those visitors. The blog is an idea repository and I go back sometimes for recall. Some highlights 1/🧵 liorpachter.wordpress.com
Image
Just today I revisited the PCA post to recall some of the properties of the transform. A student, Nick Markarian, taught me the Borel-Kolmogorov paradox today (topic for a future post) and the post was helpful in thinking about some things. 2/ liorpachter.wordpress.com/2014/05/26/wha…
I've been teaching a bit of phylogenetics this year and this post on the Golden-Thompson inequality just came up. 3/liorpachter.wordpress.com/2018/10/05/rat…
Read 24 tweets
Dec 24, 2023
This year I had the privilege of enjoying in-person conferences again, and in April I met @dvir_a & Dan Gorbonos, from whom I learned a bunch of interesting science. Here we are having burgers at Hans im Glück in Bonn.
And now, a 🧵about genocide.. 1/
The topic came up at dinner. History presents a heavy burden for Jews in Bonn.. even 78 years after WWII. The "Hans in luck" restaurant we were dining at is just a few meters from where the local synagogue was burned down during "Kirstallnacht" in 1938. 2/ Image
Although decades have passed since the holocaust, in Bonn the events felt closer in time. We were attending the Bonn Conference on Mathematical Life Sciences, which held a moment of silence for Holocaust Remembrance Day while we were there. 3/
Read 51 tweets
Dec 12, 2023
🌌The virial theorem relates time-averaged kinetic energy of objects to their potential energy.

🧬The Price equation relates change in a trait over time in subpopulations to their fitness.

In we observe that the virial theorem is the Price equation. 1/🧵arxiv.org/abs/2312.06114
The virial theorem is a 150-year old tool in (astro)physics. First described by Rudolf Clausius in 1870 in connection with studies of heat transfer, it gained prominence after it was used by Fred Zwicky in 1933 to posit the existence of dark matter. 2/
The virial theorem is elementary calculus. For objects w/ mass m_1,..,m_n at positions z_1,..,z_n, velocities v_1,..,v_n, & acted on by forces F_1,...F_n, the virial "theorem" is the identity shown below. S = \sum_i p_iz_i (p_i is momentum), U is potential energy; T, kinetic.3/ Image
Read 15 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(