A year ago in Nature Biotechnology, Becht et al. argued that UMAP preserved global structure better than t-SNE. Now @GCLinderman and me wrote a comment saying that their results were entirely due to the different initialization choices: biorxiv.org/content/10.110…. Thread. (1/n)
Here is the original paper: nature.com/articles/nbt.4… by @EtienneBecht@leland_mcinnes@EvNewell1 et al. They used three data sets and two quantitative evaluation metrics: (1) preservation of pairwise distances and (2) reproducibility across repeated runs. UMAP won 6/6. (2/10)
UMAP and t-SNE optimize different loss functions, but the implementations used in Becht et al. also used different default initialization choices: t-SNE was initialized randomly, whereas UMAP was initialized using the Laplacian eigenmaps (LE) embedding of the kNN graph. (3/10)
Were the results due to the different loss functions or due to the different initializations? George extended the code of Becht et al. to add UMAP with random initialization and t-SNE (using FIt-SNE) with PCA initialization to the benchmark comparison. This is the result. (4/10)
Turns out, it was *entirely* due to initialization! UMAP with random initialization preserved global structure as poorly as t-SNE with random initialization, whereas t-SNE with informative (PCA) initialization performed as well as UMAP with informative (LE) initialization. (5/10)
This is particularly obvious for the reproducibility metric: of course if one runs t-SNE with random initialization and different random seeds, one can get very different global arrangements of clusters. People tend to think it is not true for UMAP, but we show that it is. (6/10)
In our view, the results of Becht et al. do not actually support the claim that UMAP preserves global structure better than t-SNE, which is how it's been cited in the field. The real lesson is that one should not be using random initialization for either of these methods. (7/10)
This is in agreement with the recommendation to use PCA initialization (rather than random initialization) for t-SNE made in the recent paper by @CellTypist and me:
Just to be clear: this is *not* an attack on UMAP! I think UMAP is great :-) But I also think t-SNE is great. And there is plenty of room for further improvements and for better conceptual understanding of this whole family of embedding methods. (9/10)
But to decide which algorithm is more faithful to the single-cell data, further research is needed. Our Comment argues that Becht et al. paper does not answer that. (10/10)
The input here is a 1,000,000 x 78,628 matrix X with X_ij = 1 if integer i is divisible by the j'th prime number, and 0 otherwise. So columns correspond to 2, 3, 5, 7, 11, etc. The matrix is large but very sparse: only 0.0036% of entries are 1s. We'll use cosine similarity. [3/n]
We get the spectrum by changing the "exaggeration" in t-SNE, i.e. multiplying all attractive forces by a constant factor ρ. Prior work by @GCLinderman et al. showed that ρ->inf corresponds to Laplacian eigenmaps. We argue that the entire spectrum is interesting. [2/n]
Here is a toy dataset with 20 Gaussians arranged on a line, like a necklace. With LE one sees the string. With t-SNE one sees the individual beads. [3/n]
Spent some time investigating history of "double descent". As a function of model complexity, I haven't seen it described before 2017. As a function of sample size, it can be traced to 1995; earlier research seems less relevant. Also: I think we need a better term. Thread. (1/n)
I don't like the term "double descent" because it has nothing to do with gradient descent. And nothing is really descending. It's all about bias-variance tradeoffs, so maybe instead of the U-shaped tradeoff one should talk about \/\-shaped? И-shaped? UL-shaped? ʯ-shaped? (3/n)
@NikolayOskolkov is the only person I saw arguing with that. Several people provided further simulations showing that UMAP with random init can mess up the global structure. I saw @leland_mcinnes agreeing that init can be important. It makes sense. (2/n)
"The art of using t-SNE for single-cell transcriptomics" by @CellTypist and myself was published two weeks ago: nature.com/articles/s4146…. This is a thread about the initialisation, the learning rate, and the exaggeration in t-SNE. I'll use MNIST to illustrate. (1/16)
FIRST, the initialisation. Most implementations of t-SNE use random initialisation: points are initially placed randomly and gradient descent then makes similar points attract each other and collect into clusters. We argue that random initialisation is often a bad idea (2/16).
The t-SNE loss function only cares about preserving local neighbourhoods. With random initialisation, the global structure if usually not preserved, meaning that the arrangement of isolated clusters is largely arbitrary and depends mostly on the random seed. (3/16)