Super excited to announce that our paper “Clustering with UMAP: Why and How Connectivity Matters” has been accepted for a presentation at the @GclrW (@AAAI 2022) arxiv.org/abs/2108.05525) #AAAI2022
1/10
What makes a good topological structure for dimensionality reduction? What started off as @suzyahyah and me trying to visualize high dim datasets ended in us finding an improvement to the topologies used in the graph-based UMAP (McInnes, Healy, Melville): a mutual kNN graph!
2/10
In UMAP, a k-NN graph is used to generate the initial topological representation of a dataset. However, previous works has shown that a kNN graph can capture noisy links and may not be an accurate representation for a high dimensional dataset.
3/10
A mutual kNN representation has been shown to combat associated issues, namely distance concentration and hub effect. However, trying to use the UMAP with a mutual k-NN graph isn’t as clear cut since a mutual k-NN graph can contain isolated vertices and low connectivity.
4/10
To combat the issue of isolated components, we consider different methods to augment the mutual k-NN graph such as connecting isolated vertices to their nearest neighbors and including edges from the minimum spanning tree, as well as Shortest Path distances(Path Neighbors).
5/10
We observe that using a mutual k-NN representation augmented with a MST variant and Path Neighbors as the starting topology for UMAP consistently produces clearer separation between classes and preserves the “global structure” for the datasets we tested.
6/10
As shown below, for the FMNIST dataset, the vectors using the aforementioned method preserved the global structure between clothing classes from footwear classes while also depicting a clearer separation between the footwear classes.
7/10
We explore these concepts through extensive ablation studies on 4 standard image and text datasets; MNIST, FMNIST, 20NG, AG, reducing to 2 and 64 dimensions.
8/10
Our findings indicate that a more refined notion of connectivity (mutual k-NN with MST) together with a flexible method of constructing the local neighborhood (Path Neigh), can achieve a better representation as measured by downstream clustering performance.
9/10
We hope this paper serves as a reminder how important it is to consider the initial topological layout of the dataset before applying algorithms like UMAP or t-SNE.

Looking forward to seeing all the other great papers at AAAI!
10/10

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Ayush Dalmia

Ayush Dalmia Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(