In response to questions & comments by @hippopedoid, @adamgayoso, @akshaykagrawal et al. on "The Specious Art of Single-Cell Genomics", Tara Chari & I have posted an update with some new results. Tl;dr: definitely time to stop making t-SNE & UMAP plots.🧵biorxiv.org/content/10.110…
In a previous thread I talked about the (von Neumann) elephant in the dimension reduction room: t-SNE & UMAP don't preserve local or global structure, they distort distances, and they are arbitrary. Almost everybody knows this but they are used anyway...
There were some interesting technical questions about our work. One question was the extent to which PCA pre-conditioning affects results. We examined this (Supp. Fig. 3). Tl;dr: it's time to stop making t-SNE & UMAP plots (with or without PCA pre-conditioning).
Several people asked whether UMAP preserves neighbors. We didn't initially report on this because others have done so previously, and found that UMAP does *not* do a good job preserving them. @hippopedoid et al. showed this in nature.com/articles/s4146…
But since people ignored them, we added neighbor results for our data (Supp. Fig. 7). Tl;dr: t-SNE & UMAP scramble neighbors and do not preserve local structure. It's true that PCA is worse at preserving local structure but rotten apples aren't tastier than longer rotting apples.
Some people argued that UMAP on MNIST shows it is a useful visualization tool. So we looked at MNIST carefully. UMAP is quantitatively terrible and visually misleading for an analysis of MNIST. It seems good in @hippopedoid's tweet but that's a mirage.
First of all, the digits are mixed up in some of the seemingly "pure" clusters. The illusion happens because points are plotted over. Re-ordering the way in which points are displayed shows that many digits are misplaced. This is borne out in a quantitative analysis of the data.
These bad visuals don't reflect the intrinsic structure of the MNIST data. In fact, it's possible to perform digit assignment very accurately (99.87%) with the MNIST data. It just requires operating in dimension much greater than 2. paperswithcode.com/sota/image-cla…
t-SNE and UMAP also scramble the relative cluster placements. Clusters that are far apart appear near and vice versa. The sizes of the clusters have no relation to their actual sizes.
A UMAP of MNIST can even fail to reveal the correct number of digits, as this UMAP made by @akshaykagrawal shows (from his @GoogleColab notebook made to try Picasso). In it, there are two distinct clusters for the digit 8, with one sandwiched between 3 and 4.
It's not that two-dimensional visualizations are always useless. It's just that t-SNE and UMAP, when applied to high-dimensional data, provide visualizations that require users to navigate non-biologically interpretable parameters and tune visuals to their expectations.
It's convenient to say "these methods don't try to preserve distances", but the way embeddings are used inherently assumes distances and ordinal relationships have been preserved. In biology entire methods depend on such assumptions (e.g., Monocle3, velocyto, scvelo, etc.)
As a result of a lot of handwaving, hyperbole and hype, there also seems to be tons of confusion about what UMAP & t-SNE do, including confounding between their loss functions, and the properties the embeddings they produce actually have.
For example, some people pointed out that t-SNE and UMAP weren't meant to preserve local and global structure... only distance.
Others disagreed with that, and noted that t-SNE and UMAP were supposed to preserve *some* global structure (quoting, for example, the authors of t-SNE):
In disagreement with that, some noted that t-SNE and UMAP were not aimed at preserving global structure, but rather were preserving *local* structure. This is a common belief (based on the loss function), but as we and others have shown, it's just not true.
One person pointed out UMAP and related methods "don't try to preserve [distances]" but instead "put similar items near and dissimilar items not near [each other]". 🤔
The confusion about UMAP and t-SNE is not the fault of users. It reflects lack of theorems about their performance, hype by some, and the fact that the images they make with thousands or even millions of points can be beautiful (even if misleading).
The upshot: there's no need to throw a UMAP monkey wrench into your analysis. We can't help but make inferences when we look at these visuals. But "Specious Art" shows they are misleading, and they are also arbitrary (Picasso image by @sinabooeshaghi github.com/sbooeshaghi/pi…).

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lior Pachter

Lior Pachter Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @lpachter

27 Aug
It's time to stop making t-SNE & UMAP plots. In a new preprint w/ Tara Chari we show that while they display some correlation with the underlying high-dimension data, they don't preserve local or global structure & are misleading. They're also arbitrary.🧵biorxiv.org/content/10.110…
On t-SNE & UMAP preserving structure: 1) we show massive distortion by examining what happens to equidistant cells and cell types. 2) neighbors aren't preserved. 3) Biologically meaningful metrics are distorted. E.g., see below:
These distortions are inevitable. Cells or cell types that are equidistant in high dimension must exhibit increasing distortion as they increase in number. Actually, UMAP and t-SNE distortions are even worse (much worse!) than the lower bounds from theory.
Read 25 tweets
23 May
While it’s fun to banter about what constitutes a good lab, the part of this that is uncomfortable to discuss is that leaving a bad lab is in many cases near impossible. Few universities offer much support and PIs can and do retaliate, in some cases ending careers.
My first committee meeting of a biology student @UCBerkeley, when I was still a junior prof., resulted in a student breaking down in tears as he told us of abuse his advisor was inflicting on him. We brought this up with the advisor and department.
What happened? A few years later the professor was promoted to chair of the department.
Read 13 tweets
13 May
If you're working on spatial transcriptomics, I think you'll find @LambdaMoses' "Museum of Spatial Transcriptomics", which analyzes the field via its metadata, to be an incredibly useful resource. biorxiv.org/content/10.110… 1/11
The museum is organized as a main paper that provides an overview of a book (i.e. the Supplementary Material) which is based on a database of papers in the field compiled by @LambdaMoses. First the database... docs.google.com/spreadsheets/d…

It contains several hundred papers. 2/11
To undertake a comprehensive study of the field, @LambdaMoses read all these papers carefully, starting with "prequel" literature to establish historical context. The database has detailed metadata including a summary of each paper. This timeline is just of the prequel. 3/11
Read 11 tweets
13 Apr
Yesterday I posted a piece about @OrchidInc's polygenic embryo selection. I thought, based on a press release I read, that they were the first company to undertake polygenic embryo selection. 1/ liorpachter.wordpress.com/2021/04/12/the…
The press release started w/ "Orchid, the first preconception system to quantify how a couple's genetics impacts their future child's health, today announced a $4.5M seed round..". It went on to describe the company's polygenic embryo selection product. 2/ prnewswire.com/news-releases/…
I naïvely assumed that Orchid is the first company to embark on polygenic embryo selection, but TIL that is not the case. In fact, more than two years ago, an article in @TheEconomist discussed myome. 3/
Read 8 tweets
31 Mar
I have a few things to say about this tweet attacking @mbeisen and subtweeting me. Specifically, I want to talk about cancel culture gone mad... 1/14
In September I wrote a blog post reciting several false #covid19 claims and predictions made by Levitt over the course of the pandemic. That is not an "ad hominem attack". I reported Levitt's claims (with references). liorpachter.wordpress.com/2020/09/21/the… 2/14
Levitt, for his part, has responded to criticism of his failed predictions with non-sequiturs about attacks on free speech. 3/14
Read 14 tweets
29 Jan
This past week my lab published 4 @biorxivpreprint papers in applied math (biorxiv.org/content/10.110…), biology (biorxiv.org/content/10.110…), bioinformatics (biorxiv.org/content/10.110…), and instrumentation (biorxiv.org/content/10.110…). They were possible thanks to reproducibility... 1/
There is a lot of focus on the importance of reproducible science for facilitating replication of published research. That's all good, but reproducible science has another benefit: when adopted by a group it is an incredible accelerant for research *in that group*. 2/
Consider the paper we wrote on whole animal multiplexed #scRNAseq. The @GoogleColab notebooks Tara Chari wrote for the analyses were a monumental effort, but she did not start from scratch. 3/
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(