Super excited to announce that our paper “Clustering with UMAP: Why and How Connectivity Matters” has been accepted for a presentation at the @GclrW (@AAAI 2022) arxiv.org/abs/2108.05525) #AAAI2022
1/10
What makes a good topological structure for dimensionality reduction? What started off as @suzyahyah and me trying to visualize high dim datasets ended in us finding an improvement to the topologies used in the graph-based UMAP (McInnes, Healy, Melville): a mutual kNN graph!
2/10
In UMAP, a k-NN graph is used to generate the initial topological representation of a dataset. However, previous works has shown that a kNN graph can capture noisy links and may not be an accurate representation for a high dimensional dataset.
3/10
A mutual kNN representation has been shown to combat associated issues, namely distance concentration and hub effect. However, trying to use the UMAP with a mutual k-NN graph isn’t as clear cut since a mutual k-NN graph can contain isolated vertices and low connectivity.
4/10
To combat the issue of isolated components, we consider different methods to augment the mutual k-NN graph such as connecting isolated vertices to their nearest neighbors and including edges from the minimum spanning tree, as well as Shortest Path distances(Path Neighbors).
5/10
We observe that using a mutual k-NN representation augmented with a MST variant and Path Neighbors as the starting topology for UMAP consistently produces clearer separation between classes and preserves the “global structure” for the datasets we tested.
6/10
As shown below, for the FMNIST dataset, the vectors using the aforementioned method preserved the global structure between clothing classes from footwear classes while also depicting a clearer separation between the footwear classes.
7/10
We explore these concepts through extensive ablation studies on 4 standard image and text datasets; MNIST, FMNIST, 20NG, AG, reducing to 2 and 64 dimensions.
8/10
Our findings indicate that a more refined notion of connectivity (mutual k-NN with MST) together with a flexible method of constructing the local neighborhood (Path Neigh), can achieve a better representation as measured by downstream clustering performance.
9/10
We hope this paper serves as a reminder how important it is to consider the initial topological layout of the dataset before applying algorithms like UMAP or t-SNE.
Looking forward to seeing all the other great papers at AAAI!
10/10
• • •
Missing some Tweet in this thread? You can try to
force a refresh