PHATE finds the same 4/7/9 and 8/5/3 mega-clusters that are also emphasized by UMAP, but fails to separate some of the digits within mega-clusters, e.g. green & red (3 and 5) overlap a lot.
IMHO that's a clearly worse performance than t-SNE or UMAP. [2/7]
Of course PHATE was designed for continuous data and that's where it's supposed to shine. But the original paper and tweets like this one and the one above make it look as if it hands-down outperforms t-SNE/UMAP for clustered data.
Here is Tasic et al. 2018 dataset. Here again, PHATE isolates large families (excitatory neurons, Sst+Pvalb interneurons, Lamp5+Vip interneurons, etc.) clearer than t-SNE, but messes up within-family structures. E.g. Vip (purple) gets wrongly entangled with Lamp5 (salmon)! [4/7]
And here is n=1.3mln dataset: t-SNE with exaggeration 4 (which is basically UMAP) vs. PHATE. Judge for yourself.
Note that PHATE needed 11 hours (!) to run (and crashed a 20-core 256Gb RAM computer until I used undocumented `knn_max` param as recommended by @scottgigante). [5/7]
In comparison, t-SNE runs in like 15 minutes. The long runtime of PHATE is because it's constructing the exact (!) kNN graph and not an approximate one. I don't quite understand why they need the exact kNN.
After the graph is constructed, PHATE uses landmarks for MDS. [6/7]
Here is a summary for all three datasets.
As I said, I think the PHATE paper is interesting, and there are some nice ideas in there, and the method might very well work fine for some developmental datasets -- but I certainly cannot agree that one should "ditch" t-SNE/UMAP. [7/7]
• • •
Missing some Tweet in this thread? You can try to
force a refresh
In a new paper with @JanLause & @CellTypist we argue that the best approach for normalization of UMI counts is *analytic Pearson residuals*, using NB model with an offset term for seq depth. + We analyze related 2019 papers by @satijalab and @rafalab. /1
Our project began when we looked at Fig 2 in Hafemeister & Satija 2019 (genomebiology.biomedcentral.com/articles/10.11…) who suggested to use NB regression (w/ smoothed params), and wondered:
1) Why does smoothed β_0 grow linearly? 2) Why is smoothed β_1 ≈ 2.3?? 3) Why does smoothed θ grow too??? /2
The original paper does not answer any of that.
Jan figured out that: (1) is trivially true when assuming UMI ~ NB(p_gene * n_cell); (2) simply follows from HS2019 parametrization & the magic constant is 2.3=ln(10); (3) is due to bias in estimation of overdispersion param θ! /3
The input here is a 1,000,000 x 78,628 matrix X with X_ij = 1 if integer i is divisible by the j'th prime number, and 0 otherwise. So columns correspond to 2, 3, 5, 7, 11, etc. The matrix is large but very sparse: only 0.0036% of entries are 1s. We'll use cosine similarity. [3/n]
We get the spectrum by changing the "exaggeration" in t-SNE, i.e. multiplying all attractive forces by a constant factor ρ. Prior work by @GCLinderman et al. showed that ρ->inf corresponds to Laplacian eigenmaps. We argue that the entire spectrum is interesting. [2/n]
Here is a toy dataset with 20 Gaussians arranged on a line, like a necklace. With LE one sees the string. With t-SNE one sees the individual beads. [3/n]
Spent some time investigating history of "double descent". As a function of model complexity, I haven't seen it described before 2017. As a function of sample size, it can be traced to 1995; earlier research seems less relevant. Also: I think we need a better term. Thread. (1/n)
I don't like the term "double descent" because it has nothing to do with gradient descent. And nothing is really descending. It's all about bias-variance tradeoffs, so maybe instead of the U-shaped tradeoff one should talk about \/\-shaped? И-shaped? UL-shaped? ʯ-shaped? (3/n)
@NikolayOskolkov is the only person I saw arguing with that. Several people provided further simulations showing that UMAP with random init can mess up the global structure. I saw @leland_mcinnes agreeing that init can be important. It makes sense. (2/n)
A year ago in Nature Biotechnology, Becht et al. argued that UMAP preserved global structure better than t-SNE. Now @GCLinderman and me wrote a comment saying that their results were entirely due to the different initialization choices: biorxiv.org/content/10.110…. Thread. (1/n)
Here is the original paper: nature.com/articles/nbt.4… by @EtienneBecht@leland_mcinnes@EvNewell1 et al. They used three data sets and two quantitative evaluation metrics: (1) preservation of pairwise distances and (2) reproducibility across repeated runs. UMAP won 6/6. (2/10)
UMAP and t-SNE optimize different loss functions, but the implementations used in Becht et al. also used different default initialization choices: t-SNE was initialized randomly, whereas UMAP was initialized using the Laplacian eigenmaps (LE) embedding of the kNN graph. (3/10)