Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Dmitry Kobak

@hippopedoid

Jan 12, 2021 • 7 tweets • 4 min read • Read on X

Scrolly

@KrishnaswamyLab

OK, I'll bite.

PHATE (nature.com/articles/s4158…) from @KrishnaswamyLab is like Isomap meeting Diffusion Maps: MDS on geo distances obtained via diffusion. Cool paper!

So let's test it on: (1) MNIST, (2) Tasic2018, (3) n=1.3mln from 10x. Does it work as well as promised? 🧐 [1/7]

https://twitter.com/KrishnaswamyLab/status/1346302977712254976

Here is MNIST.

PHATE finds the same 4/7/9 and 8/5/3 mega-clusters that are also emphasized by UMAP, but fails to separate some of the digits within mega-clusters, e.g. green & red (3 and 5) overlap a lot.

IMHO that's a clearly worse performance than t-SNE or UMAP. [2/7]

https://twitter.com/KrishnaswamyLab/status/1201937339574050822

Of course PHATE was designed for continuous data and that's where it's supposed to shine. But the original paper and tweets like this one and the one above make it look as if it hands-down outperforms t-SNE/UMAP for clustered data.

I'm unconvinced. [3/7]

https://twitter.com/KrishnaswamyLab/status/1201937339574050822

Here is Tasic et al. 2018 dataset. Here again, PHATE isolates large families (excitatory neurons, Sst+Pvalb interneurons, Lamp5+Vip interneurons, etc.) clearer than t-SNE, but messes up within-family structures. E.g. Vip (purple) gets wrongly entangled with Lamp5 (salmon)! [4/7]

@scottgigante

And here is n=1.3mln dataset: t-SNE with exaggeration 4 (which is basically UMAP) vs. PHATE. Judge for yourself.

Note that PHATE needed 11 hours (!) to run (and crashed a 20-core 256Gb RAM computer until I used undocumented `knn_max` param as recommended by @scottgigante). [5/7]

In comparison, t-SNE runs in like 15 minutes. The long runtime of PHATE is because it's constructing the exact (!) kNN graph and not an approximate one. I don't quite understand why they need the exact kNN.

After the graph is constructed, PHATE uses landmarks for MDS. [6/7]

Here is a summary for all three datasets.

As I said, I think the PHATE paper is interesting, and there are some nice ideas in there, and the method might very well work fine for some developmental datasets -- but I certainly cannot agree that one should "ditch" t-SNE/UMAP. [7/7]

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @hippopedoid

Dmitry Kobak

@hippopedoid

Jun 21, 2024

How many academic papers are written with the help of ChatGPT? To answer this question, we analyzed 14mln PubMed abstracts from 2010 to 2024 and looked for excess words:

** Delving into ChatGPT usage in academic writing through excess vocabulary **

1/11 arxiv.org/abs/2406.07016

Many have noticed that ChatGPT likes the word "delve". Or "crucial". Or "intricate".

We checked ALL WORDS from ALL PubMed abstracts to find words with sudden increase in popularity in 2023-24.

For comparison, we did the same for ALL YEARS from 2010 onward.
2/11

We found LOTS of words with increased usage in 2024.

"Delves" got 25 times (!) more frequent. "Showcasing" and "underscores" -- 9 times more.

Among more common words, "potential" saw an increase of 4 percentage points; "crucial" and "findings" -- 3 perc. points.
3/11

Read 11 tweets

Dmitry Kobak

@hippopedoid

Apr 13, 2023

@ritagonmar

Really excited to present new work by @ritagonmar: we visualized the entire PubMed library, 21 million biomedical and life science papers, and learned a lot about --

THE LANDSCAPE OF BIOMEDICAL RESEARCH
biorxiv.org/content/10.110…

Joint work with @CellTypist and @benmschmidt. 1/n

We took all (21M) English abstracts from PubMed, used a BERT model (PubMedBERT) to transform them into 768D vectors, and then used t-SNE to visualize them in 2D.

We used the 2D map to explore the library, and confirmed each insight in 768D.

We focus on four insights. 2/n

Case study #1: Covid-19 literature.

When looking at the t-SNE map colored by publication year (yellow = newer papers), we immediately see a bright yellow cluster. A large cluster of related papers, all published in 2020-21. What could it be? 🤔

Of course it's Covid papers. 3/n

Read 12 tweets

Dmitry Kobak

@hippopedoid

Mar 30, 2023

@giffmana

We held a reading group on Transformers (watched videos / read blog posts / studied papers by @giffmana @karpathy @ch402 @amaarora @JayAlammar @srush_nlp et al.), and now I _finally_ roughly understand what attention does.

Here is my take on it. A summary thread. 1/n

Consider BERT/GPT setting.

We have a text string, split into tokens (<=512). Each token gets a 768-dim vector. So we have a 2D matrix X of arbitrary width. We want to set up a feed-forward layer that would somehow transform X, keeping its shape.

How can this be set up? 2/n

Fully-connected layer does not work: it cannot take input of variable length (and would have too many params anyway).

Only acting on the embedding dimension would process each token separately, which is clearly not sufficient.

How can we make the tokens interact? 3/n

Read 13 tweets

Dmitry Kobak

@hippopedoid

Dec 2, 2022

@FredHamprecht

A very long overdue thread: happy to share preprint led by Sebastian Damrich from @FredHamprecht's lab.

*From t-SNE to UMAP with contrastive learning*
arxiv.org/abs/2206.01816

I think we have finally understood the *real* difference between t-SNE and UMAP. It involves NCE! [1/n]

@jnboehm

In prior work, we (@jnboehm @CellTypist) showed that UMAP works like t-SNE with extra attraction. We argued that it is because UMAP relies on negative sampling, whereas t-SNE does not.

Turns out, this was not the whole story. [2/n]

https://twitter.com/hippopedoid/status/1285154214147235840

@jnboehm

Because UMAP uses negative sampling, its effective loss function is very different from its stated loss function (cross-entropy). @jnboehm showed it via Barnes-Hut UMAP, while Sebastian and Fred did mathematical analysis in their NeurIPS 2021 paper proceedings.neurips.cc/paper/2021/has… [3/n]

Read 8 tweets

Dmitry Kobak

@hippopedoid

Apr 26, 2022

@signmagazine

My paper on Poisson underdispersion in reported Covid-19 cases & deaths is out in @signmagazine. The claim is that underdispersion is a HUGE RED FLAG and suggests misreporting.

Paper: rss.onlinelibrary.wiley.com/doi/10.1111/17…
Code: github.com/dkobak/covid-u…

Figure below highlights 🇷🇺 and 🇺🇦. /1

What is "underdispersion"? Here is an example. Russia reported the following number of Covid deaths during the first week of September 2021: 792, 795, 790, 798, 799, 796, 793.

Mean: 795. Variance: 11. For Poisson random data, mean=variance. So this is *underdispersed*. /2

For comparison, during the same week US reported 1461, 1185, 1202, 1795, 2010, 2003, 1942 deaths. Mean: 1657. Variance: 135470. So this is *overdispersed*.

Overdispersion is not surprising: day-of-week reporting fluctuations, epidemic growth, etc.

But underdispersion is. /3

Read 11 tweets

Dmitry Kobak

@hippopedoid

Sep 30, 2021

So what's up with the Russian election two weeks ago? Was there fraud?

Of course there was fraud. Widespread ballot stuffing was videotaped etc., but we can also prove fraud using statistics.

See these *integer peaks* in the histograms of the polling station results? 🕵️‍♂️ [1/n]

These peaks are formed by polling stations that report integer turnout percentage or United Russia percentage. E.g. 1492 ballots cast at a station with 1755 registered voters. 1492/1755 = 85.0%. Important: 1492 is not a suspicious number! It's 85.0% which is suspicious. [2/n]

We can use binomial Monte Carlo simulation to find how many polling stations with integer percentages there should be by chance. Then we can compute the number of EXCESS integer polling stations (roughly the summed heights of all INTEGER PEAKS).

Resulting excess is 1300. [3/n]

Read 11 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Dmitry Kobak

Try unrolling a thread yourself!

More from @hippopedoid

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!