Tweet

Dmitry Kobak

21 Oct, 11 tweets, 6 min read

@jhnhw

Remember the galaxy-like UMAP visualization of integers from 1 to 1,000,000 represented as prime factors, made by @jhnhw?

I did t-SNE of the same data, and figured out what the individual blobs are. Turns out, the swirly and spaghetti UMAP structures were artifacts :-(

[1/n]

@jhnhw

Here is the original tweet by @jhnhw. His write-up: johnhw.github.io/umap_primes/in…. UMAP preprint v2 by @leland_mcinnes et al. has a figure with 30,000,000 (!) integers.

But what are all the swirls and spaghetti?

Unexplained mystery since 2008. CC @ch402. [2/n]

https://twitter.com/jhnhw/status/1031829726757900288

The input here is a 1,000,000 x 78,628 matrix X with X_ij = 1 if integer i is divisible by the j'th prime number, and 0 otherwise. So columns correspond to 2, 3, 5, 7, 11, etc. The matrix is large but very sparse: only 0.0036% of entries are 1s. We'll use cosine similarity. [3/n]

@pavlinpolicar

I use openTSNE by @pavlinpolicar. It uses Pynndescent by @leland_mcinnes to construct the kNN graph for sparse inputs. I'm using uniform affinities with k=15.

NB: t-SNE is faster than UMAP and needs less memory. I could run t-SNE *but not UMAP* on my 16Gb RAM laptop. [4/n]

First of all -- the current version of UMAP does not produce any swirls/spaghetti 😧 I was getting them back in February (with UMAP 0.2 or 0.3), but with UMAP 0.4 I only get blobs and some doughnuts. This suggests that swirls/spaghetti were convergence artifacts. [5/n]

t-SNE only shows blobs (and some stardust). Can we understand what they are?

Yes We Can!

The max number of prime factors (max row sum) in this dataset is 7. Coloring the embeddings by the number of prime factors shows that each blob has the same number of them. [6/n]

@wtgowers

Integers with two prime factors (orange) make up multiple blobs. Turns out, the largest blob consists of all such numbers that are divisible by 2. The next one --- of all such numbers divisible by 5. The next one -- by 7, etc. Here are labels up to 19. CC @wtgowers. [7/n]

Similarly, numbers with three prime factors (green) are grouped into blobs by a combination of two shared prime factors. Etc.

This makes total sense if one thinks about how cosine similarity works. All numbers in one blob are equidistant from each other. [8/n]

So my guess is that the doughnuts in UMAP 0.4 are optimization artifacts of some kind: blobs do not really have any internal structure, as correctly shown by t-SNE. Why they appear in UMAP, I don't know.

For completeness, here are both UMAP versions labeled this way. [9/n]

This was a lot of fun to play around with. And led to several improvements in openTSNE along the way.

The main lesson, I guess, is that convergence artifacts can be very pretty ;-)

[10/10]

And now an animation! Gradually showing all integers from 1 to 1,000,000 in 50 steps of 20,000. [11/10]

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @hippopedoid

Dmitry Kobak

@hippopedoid

20 Jul

@jnboehm

New preprint on attraction-repulsion spectrum in t-SNE => continuity-discreteness trade-off!

We also show that UMAP has higher attraction due to negative sampling, and not due to its loss. 🤯 Plus we demystify FA2.

With @jnboehm and @CellTypist.
arxiv.org/abs/2007.08902 [1/n]

@GCLinderman

We get the spectrum by changing the "exaggeration" in t-SNE, i.e. multiplying all attractive forces by a constant factor ρ. Prior work by @GCLinderman et al. showed that ρ->inf corresponds to Laplacian eigenmaps. We argue that the entire spectrum is interesting. [2/n]

Stronger attraction preserves continuous manifold structure. Stronger repulsion brings out discrete cluster structure.

Here is a toy dataset with 20 Gaussians arranged on a line, like a necklace. With LE one sees the string. With t-SNE one sees the individual beads. [3/n]

Read 10 tweets

Dmitry Kobak

@hippopedoid

26 Mar

Spent some time investigating history of "double descent". As a function of model complexity, I haven't seen it described before 2017. As a function of sample size, it can be traced to 1995; earlier research seems less relevant. Also: I think we need a better term. Thread. (1/n)

The term "double descent" was coined by Belkin et al 2019 pnas.org/content/116/32… but the same phenomenon was also described in two earlier preprints: Spigler et al 2019 iopscience.iop.org/article/10.108… and Advani & Saxe 2017 arxiv.org/abs/1710.03667 (still unpublished?) (2/n)

I don't like the term "double descent" because it has nothing to do with gradient descent. And nothing is really descending. It's all about bias-variance tradeoffs, so maybe instead of the U-shaped tradeoff one should talk about \/\-shaped? И-shaped? UL-shaped? ʯ-shaped? (3/n)

Read 13 tweets

Dmitry Kobak

@hippopedoid

12 Feb

@GCLinderman

Becht et al.: UMAP preserves global structure better than t-SNE.

@GCLinderman & me: only because you used random init for t-SNE but spectral init for UMAP.

@NikolayOskolkov: that's wrong; init does not matter; the loss function does.

This thread is a response to Nikolay. (1/n)

https://twitter.com/hippopedoid/status/1207999178015727616

@NikolayOskolkov

@NikolayOskolkov is the only person I saw arguing with that. Several people provided further simulations showing that UMAP with random init can mess up the global structure. I saw @leland_mcinnes agreeing that init can be important. It makes sense. (2/n)

https://twitter.com/leland_mcinnes/status/1215025214674878474

@NikolayOskolkov

But @NikolayOskolkov argued against. Here is his popular UMAP write-up: towardsdatascience.com/how-exactly-um…, and here: towardsdatascience.com/why-umap-is-su… he explicitly disagreed with our Comment. I think his UMAP posts are great and I like them a lot, but in this point I believe he is mistaken. (3/n)

Read 12 tweets

Dmitry Kobak

@hippopedoid

16 Dec 19

@CellTypist

"The art of using t-SNE for single-cell transcriptomics" by @CellTypist and myself was published two weeks ago: nature.com/articles/s4146…. This is a thread about the initialisation, the learning rate, and the exaggeration in t-SNE. I'll use MNIST to illustrate. (1/16)

FIRST, the initialisation. Most implementations of t-SNE use random initialisation: points are initially placed randomly and gradient descent then makes similar points attract each other and collect into clusters. We argue that random initialisation is often a bad idea (2/16).

The t-SNE loss function only cares about preserving local neighbourhoods. With random initialisation, the global structure if usually not preserved, meaning that the arrangement of isolated clusters is largely arbitrary and depends mostly on the random seed. (3/16)

Read 16 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

Dmitry Kobak

Try unrolling a thread yourself!

More from @hippopedoid

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Did Thread Reader help you today?

Like this author's thread?