Tweet

@KrishnaswamyLab

https://twitter.com/KrishnaswamyLab/status/1201937339574050822

@scottgigante

More from @hippopedoid

Dmitry Kobak

@hippopedoid

10 Dec 20

@JanLause

In a new paper with @JanLause & @CellTypist we argue that the best approach for normalization of UMI counts is *analytic Pearson residuals*, using NB model with an offset term for seq depth. + We analyze related 2019 papers by @satijalab and @rafalab. /1

biorxiv.org/content/10.110…

Our project began when we looked at Fig 2 in Hafemeister & Satija 2019 (genomebiology.biomedcentral.com/articles/10.11…) who suggested to use NB regression (w/ smoothed params), and wondered:

1) Why does smoothed β_0 grow linearly?
2) Why is smoothed β_1 ≈ 2.3??
3) Why does smoothed θ grow too??? /2

The original paper does not answer any of that.

Jan figured out that: (1) is trivially true when assuming UMI ~ NB(p_gene * n_cell); (2) simply follows from HS2019 parametrization & the magic constant is 2.3=ln(10); (3) is due to bias in estimation of overdispersion param θ! /3

Read 12 tweets

Dmitry Kobak

@hippopedoid

21 Oct 20

@jhnhw

Remember the galaxy-like UMAP visualization of integers from 1 to 1,000,000 represented as prime factors, made by @jhnhw?

I did t-SNE of the same data, and figured out what the individual blobs are. Turns out, the swirly and spaghetti UMAP structures were artifacts :-(

[1/n]

@jhnhw

Here is the original tweet by @jhnhw. His write-up: johnhw.github.io/umap_primes/in…. UMAP preprint v2 by @leland_mcinnes et al. has a figure with 30,000,000 (!) integers.

But what are all the swirls and spaghetti?

Unexplained mystery since 2008. CC @ch402. [2/n]

https://twitter.com/jhnhw/status/1031829726757900288

The input here is a 1,000,000 x 78,628 matrix X with X_ij = 1 if integer i is divisible by the j'th prime number, and 0 otherwise. So columns correspond to 2, 3, 5, 7, 11, etc. The matrix is large but very sparse: only 0.0036% of entries are 1s. We'll use cosine similarity. [3/n]

Read 11 tweets

Dmitry Kobak

@hippopedoid

20 Jul 20

@jnboehm

New preprint on attraction-repulsion spectrum in t-SNE => continuity-discreteness trade-off!

We also show that UMAP has higher attraction due to negative sampling, and not due to its loss. 🤯 Plus we demystify FA2.

With @jnboehm and @CellTypist.
arxiv.org/abs/2007.08902 [1/n]

@GCLinderman

We get the spectrum by changing the "exaggeration" in t-SNE, i.e. multiplying all attractive forces by a constant factor ρ. Prior work by @GCLinderman et al. showed that ρ->inf corresponds to Laplacian eigenmaps. We argue that the entire spectrum is interesting. [2/n]

Stronger attraction preserves continuous manifold structure. Stronger repulsion brings out discrete cluster structure.

Here is a toy dataset with 20 Gaussians arranged on a line, like a necklace. With LE one sees the string. With t-SNE one sees the individual beads. [3/n]

Read 10 tweets

Dmitry Kobak

@hippopedoid

26 Mar 20

Spent some time investigating history of "double descent". As a function of model complexity, I haven't seen it described before 2017. As a function of sample size, it can be traced to 1995; earlier research seems less relevant. Also: I think we need a better term. Thread. (1/n)

The term "double descent" was coined by Belkin et al 2019 pnas.org/content/116/32… but the same phenomenon was also described in two earlier preprints: Spigler et al 2019 iopscience.iop.org/article/10.108… and Advani & Saxe 2017 arxiv.org/abs/1710.03667 (still unpublished?) (2/n)

I don't like the term "double descent" because it has nothing to do with gradient descent. And nothing is really descending. It's all about bias-variance tradeoffs, so maybe instead of the U-shaped tradeoff one should talk about \/\-shaped? И-shaped? UL-shaped? ʯ-shaped? (3/n)

Read 13 tweets

Dmitry Kobak

@hippopedoid

12 Feb 20

@GCLinderman

Becht et al.: UMAP preserves global structure better than t-SNE.

@GCLinderman & me: only because you used random init for t-SNE but spectral init for UMAP.

@NikolayOskolkov: that's wrong; init does not matter; the loss function does.

This thread is a response to Nikolay. (1/n)

https://twitter.com/hippopedoid/status/1207999178015727616

@NikolayOskolkov

@NikolayOskolkov is the only person I saw arguing with that. Several people provided further simulations showing that UMAP with random init can mess up the global structure. I saw @leland_mcinnes agreeing that init can be important. It makes sense. (2/n)

https://twitter.com/leland_mcinnes/status/1215025214674878474

@NikolayOskolkov

But @NikolayOskolkov argued against. Here is his popular UMAP write-up: towardsdatascience.com/how-exactly-um…, and here: towardsdatascience.com/why-umap-is-su… he explicitly disagreed with our Comment. I think his UMAP posts are great and I like them a lot, but in this point I believe he is mistaken. (3/n)

Read 12 tweets

Dmitry Kobak

@hippopedoid

20 Dec 19

@GCLinderman

A year ago in Nature Biotechnology, Becht et al. argued that UMAP preserved global structure better than t-SNE. Now @GCLinderman and me wrote a comment saying that their results were entirely due to the different initialization choices: biorxiv.org/content/10.110…. Thread. (1/n)

@EtienneBecht

Here is the original paper: nature.com/articles/nbt.4… by @EtienneBecht @leland_mcinnes @EvNewell1 et al. They used three data sets and two quantitative evaluation metrics: (1) preservation of pairwise distances and (2) reproducibility across repeated runs. UMAP won 6/6. (2/10)

UMAP and t-SNE optimize different loss functions, but the implementations used in Becht et al. also used different default initialization choices: t-SNE was initialized randomly, whereas UMAP was initialized using the Laplacian eigenmaps (LE) embedding of the kNN graph. (3/10)

Read 12 tweets

Share this page!

Dmitry Kobak

Try unrolling a thread yourself!

More from @hippopedoid

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Did Thread Reader help you today?

Like this author's thread?