PHATE finds the same 4/7/9 and 8/5/3 mega-clusters that are also emphasized by UMAP, but fails to separate some of the digits within mega-clusters, e.g. green & red (3 and 5) overlap a lot.
IMHO that's a clearly worse performance than t-SNE or UMAP. [2/7]
Of course PHATE was designed for continuous data and that's where it's supposed to shine. But the original paper and tweets like this one and the one above make it look as if it hands-down outperforms t-SNE/UMAP for clustered data.
Here is Tasic et al. 2018 dataset. Here again, PHATE isolates large families (excitatory neurons, Sst+Pvalb interneurons, Lamp5+Vip interneurons, etc.) clearer than t-SNE, but messes up within-family structures. E.g. Vip (purple) gets wrongly entangled with Lamp5 (salmon)! [4/7]
And here is n=1.3mln dataset: t-SNE with exaggeration 4 (which is basically UMAP) vs. PHATE. Judge for yourself.
Note that PHATE needed 11 hours (!) to run (and crashed a 20-core 256Gb RAM computer until I used undocumented `knn_max` param as recommended by @scottgigante). [5/7]
In comparison, t-SNE runs in like 15 minutes. The long runtime of PHATE is because it's constructing the exact (!) kNN graph and not an approximate one. I don't quite understand why they need the exact kNN.
After the graph is constructed, PHATE uses landmarks for MDS. [6/7]
Here is a summary for all three datasets.
As I said, I think the PHATE paper is interesting, and there are some nice ideas in there, and the method might very well work fine for some developmental datasets -- but I certainly cannot agree that one should "ditch" t-SNE/UMAP. [7/7]
• • •
Missing some Tweet in this thread? You can try to
force a refresh
How many academic papers are written with the help of ChatGPT? To answer this question, we analyzed 14mln PubMed abstracts from 2010 to 2024 and looked for excess words:
** Delving into ChatGPT usage in academic writing through excess vocabulary **
Really excited to present new work by @ritagonmar: we visualized the entire PubMed library, 21 million biomedical and life science papers, and learned a lot about --
We took all (21M) English abstracts from PubMed, used a BERT model (PubMedBERT) to transform them into 768D vectors, and then used t-SNE to visualize them in 2D.
We used the 2D map to explore the library, and confirmed each insight in 768D.
We focus on four insights. 2/n
Case study #1: Covid-19 literature.
When looking at the t-SNE map colored by publication year (yellow = newer papers), we immediately see a bright yellow cluster. A large cluster of related papers, all published in 2020-21. What could it be? 🤔
We held a reading group on Transformers (watched videos / read blog posts / studied papers by @giffmana@karpathy@ch402@amaarora@JayAlammar@srush_nlp et al.), and now I _finally_ roughly understand what attention does.
Here is my take on it. A summary thread. 1/n
Consider BERT/GPT setting.
We have a text string, split into tokens (<=512). Each token gets a 768-dim vector. So we have a 2D matrix X of arbitrary width. We want to set up a feed-forward layer that would somehow transform X, keeping its shape.
How can this be set up? 2/n
Fully-connected layer does not work: it cannot take input of variable length (and would have too many params anyway).
Only acting on the embedding dimension would process each token separately, which is clearly not sufficient.
I think we have finally understood the *real* difference between t-SNE and UMAP. It involves NCE! [1/n]
In prior work, we (@jnboehm@CellTypist) showed that UMAP works like t-SNE with extra attraction. We argued that it is because UMAP relies on negative sampling, whereas t-SNE does not.
Because UMAP uses negative sampling, its effective loss function is very different from its stated loss function (cross-entropy). @jnboehm showed it via Barnes-Hut UMAP, while Sebastian and Fred did mathematical analysis in their NeurIPS 2021 paper proceedings.neurips.cc/paper/2021/has… [3/n]
My paper on Poisson underdispersion in reported Covid-19 cases & deaths is out in @signmagazine. The claim is that underdispersion is a HUGE RED FLAG and suggests misreporting.
What is "underdispersion"? Here is an example. Russia reported the following number of Covid deaths during the first week of September 2021: 792, 795, 790, 798, 799, 796, 793.
Mean: 795. Variance: 11. For Poisson random data, mean=variance. So this is *underdispersed*. /2
For comparison, during the same week US reported 1461, 1185, 1202, 1795, 2010, 2003, 1942 deaths. Mean: 1657. Variance: 135470. So this is *overdispersed*.
Overdispersion is not surprising: day-of-week reporting fluctuations, epidemic growth, etc.
So what's up with the Russian election two weeks ago? Was there fraud?
Of course there was fraud. Widespread ballot stuffing was videotaped etc., but we can also prove fraud using statistics.
See these *integer peaks* in the histograms of the polling station results? 🕵️♂️ [1/n]
These peaks are formed by polling stations that report integer turnout percentage or United Russia percentage. E.g. 1492 ballots cast at a station with 1755 registered voters. 1492/1755 = 85.0%. Important: 1492 is not a suspicious number! It's 85.0% which is suspicious. [2/n]
We can use binomial Monte Carlo simulation to find how many polling stations with integer percentages there should be by chance. Then we can compute the number of EXCESS integer polling stations (roughly the summed heights of all INTEGER PEAKS).