Dmitry Kobak Profile picture
Researcher at @uni_tue. Manifold learning, contrastive learning, scRNAseq data. Excess mortality. Born but to die and reas'ning but to err.
Jun 21 11 tweets 6 min read
How many academic papers are written with the help of ChatGPT? To answer this question, we analyzed 14mln PubMed abstracts from 2010 to 2024 and looked for excess words:

** Delving into ChatGPT usage in academic writing through excess vocabulary **

1/11 arxiv.org/abs/2406.07016
Image Many have noticed that ChatGPT likes the word "delve". Or "crucial". Or "intricate".

We checked ALL WORDS from ALL PubMed abstracts to find words with sudden increase in popularity in 2023-24.

For comparison, we did the same for ALL YEARS from 2010 onward.
2/11 Image
Apr 13, 2023 12 tweets 6 min read
Really excited to present new work by @ritagonmar: we visualized the entire PubMed library, 21 million biomedical and life science papers, and learned a lot about --

THE LANDSCAPE OF BIOMEDICAL RESEARCH
biorxiv.org/content/10.110…

Joint work with @CellTypist and @benmschmidt. 1/n Image We took all (21M) English abstracts from PubMed, used a BERT model (PubMedBERT) to transform them into 768D vectors, and then used t-SNE to visualize them in 2D.

We used the 2D map to explore the library, and confirmed each insight in 768D.

We focus on four insights. 2/n Image
Mar 30, 2023 13 tweets 8 min read
We held a reading group on Transformers (watched videos / read blog posts / studied papers by @giffmana @karpathy @ch402 @amaarora @JayAlammar @srush_nlp et al.), and now I _finally_ roughly understand what attention does.

Here is my take on it. A summary thread. 1/n Consider BERT/GPT setting.

We have a text string, split into tokens (<=512). Each token gets a 768-dim vector. So we have a 2D matrix X of arbitrary width. We want to set up a feed-forward layer that would somehow transform X, keeping its shape.

How can this be set up? 2/n
Dec 2, 2022 8 tweets 5 min read
A very long overdue thread: happy to share preprint led by Sebastian Damrich from @FredHamprecht's lab.

*From t-SNE to UMAP with contrastive learning*
arxiv.org/abs/2206.01816

I think we have finally understood the *real* difference between t-SNE and UMAP. It involves NCE! [1/n] In prior work, we (@jnboehm @CellTypist) showed that UMAP works like t-SNE with extra attraction. We argued that it is because UMAP relies on negative sampling, whereas t-SNE does not.

Turns out, this was not the whole story. [2/n]
Apr 26, 2022 11 tweets 7 min read
My paper on Poisson underdispersion in reported Covid-19 cases & deaths is out in @signmagazine. The claim is that underdispersion is a HUGE RED FLAG and suggests misreporting.

Paper: rss.onlinelibrary.wiley.com/doi/10.1111/17…
Code: github.com/dkobak/covid-u…

Figure below highlights 🇷🇺 and 🇺🇦. /1 What is "underdispersion"? Here is an example. Russia reported the following number of Covid deaths during the first week of September 2021: 792, 795, 790, 798, 799, 796, 793.

Mean: 795. Variance: 11. For Poisson random data, mean=variance. So this is *underdispersed*. /2
Sep 30, 2021 11 tweets 6 min read
So what's up with the Russian election two weeks ago? Was there fraud?

Of course there was fraud. Widespread ballot stuffing was videotaped etc., but we can also prove fraud using statistics.

See these *integer peaks* in the histograms of the polling station results? 🕵️‍♂️ [1/n] Image These peaks are formed by polling stations that report integer turnout percentage or United Russia percentage. E.g. 1492 ballots cast at a station with 1755 registered voters. 1492/1755 = 85.0%. Important: 1492 is not a suspicious number! It's 85.0% which is suspicious. [2/n]
Sep 23, 2021 12 tweets 5 min read
Chari et al. (@lpachter) have updated their preprint and doubled down on their claim that an 🐘-looking embedding, a random (!) embedding, and 2D PCA, all preserve data structure "similar or better" than t-SNE.

I still think this claim is absurd. [1/n] They literally say: "Picasso can quantitatively represent [local and global properties] similarly to, or better, than the respective t-SNE/UMAP embeddings".

In my thread below I argued it's a non-sequitur from Fig 2, due to insufficient metrics. [2/n]
Sep 13, 2021 12 tweets 5 min read
I am late to the party (was on holidays), but have now read @lpachter's "Specious Art" paper as well as ~300 quote tweets/threads, played with the code, and can add my two cents.

Spoiler: I disagree with their conclusions. Some claims re t-SNE/UMAP are misleading. Thread. 🐘 The paper has several parts and I have too many comments for a twitter thread, so here I will only focus on the core of the authors' argument against t-SNE/UMAP, namely Figures 2 and 3. We can discuss the rest some other time. [2/n]
Jan 12, 2021 7 tweets 4 min read
OK, I'll bite.

PHATE (nature.com/articles/s4158…) from @KrishnaswamyLab is like Isomap meeting Diffusion Maps: MDS on geo distances obtained via diffusion. Cool paper!

So let's test it on: (1) MNIST, (2) Tasic2018, (3) n=1.3mln from 10x. Does it work as well as promised? 🧐 [1/7] Image Here is MNIST.

PHATE finds the same 4/7/9 and 8/5/3 mega-clusters that are also emphasized by UMAP, but fails to separate some of the digits within mega-clusters, e.g. green & red (3 and 5) overlap a lot.

IMHO that's a clearly worse performance than t-SNE or UMAP. [2/7] Image
Dec 10, 2020 12 tweets 7 min read
In a new paper with @JanLause & @CellTypist we argue that the best approach for normalization of UMI counts is *analytic Pearson residuals*, using NB model with an offset term for seq depth. + We analyze related 2019 papers by @satijalab and @rafalab. /1

biorxiv.org/content/10.110… Image Our project began when we looked at Fig 2 in Hafemeister & Satija 2019 (genomebiology.biomedcentral.com/articles/10.11…) who suggested to use NB regression (w/ smoothed params), and wondered:

1) Why does smoothed β_0 grow linearly?
2) Why is smoothed β_1 ≈ 2.3??
3) Why does smoothed θ grow too??? /2 Image
Oct 21, 2020 11 tweets 6 min read
Remember the galaxy-like UMAP visualization of integers from 1 to 1,000,000 represented as prime factors, made by @jhnhw?

I did t-SNE of the same data, and figured out what the individual blobs are. Turns out, the swirly and spaghetti UMAP structures were artifacts :-(

[1/n] Here is the original tweet by @jhnhw. His write-up: johnhw.github.io/umap_primes/in…. UMAP preprint v2 by @leland_mcinnes et al. has a figure with 30,000,000 (!) integers.

But what are all the swirls and spaghetti?

Unexplained mystery since 2008. CC @ch402. [2/n]
Jul 20, 2020 10 tweets 5 min read
New preprint on attraction-repulsion spectrum in t-SNE => continuity-discreteness trade-off!

We also show that UMAP has higher attraction due to negative sampling, and not due to its loss. 🤯 Plus we demystify FA2.

With @jnboehm and @CellTypist.
arxiv.org/abs/2007.08902 [1/n] We get the spectrum by changing the "exaggeration" in t-SNE, i.e. multiplying all attractive forces by a constant factor ρ. Prior work by @GCLinderman et al. showed that ρ->inf corresponds to Laplacian eigenmaps. We argue that the entire spectrum is interesting. [2/n]
Mar 26, 2020 13 tweets 9 min read
Spent some time investigating history of "double descent". As a function of model complexity, I haven't seen it described before 2017. As a function of sample size, it can be traced to 1995; earlier research seems less relevant. Also: I think we need a better term. Thread. (1/n) The term "double descent" was coined by Belkin et al 2019 pnas.org/content/116/32… but the same phenomenon was also described in two earlier preprints: Spigler et al 2019 iopscience.iop.org/article/10.108… and Advani & Saxe 2017 arxiv.org/abs/1710.03667 (still unpublished?) (2/n)
Feb 12, 2020 12 tweets 5 min read
Becht et al.: UMAP preserves global structure better than t-SNE.

@GCLinderman & me: only because you used random init for t-SNE but spectral init for UMAP.

@NikolayOskolkov: that's wrong; init does not matter; the loss function does.

This thread is a response to Nikolay. (1/n) @NikolayOskolkov is the only person I saw arguing with that. Several people provided further simulations showing that UMAP with random init can mess up the global structure. I saw @leland_mcinnes agreeing that init can be important. It makes sense. (2/n)
Dec 20, 2019 12 tweets 5 min read
A year ago in Nature Biotechnology, Becht et al. argued that UMAP preserved global structure better than t-SNE. Now @GCLinderman and me wrote a comment saying that their results were entirely due to the different initialization choices: biorxiv.org/content/10.110…. Thread. (1/n) Here is the original paper: nature.com/articles/nbt.4… by @EtienneBecht @leland_mcinnes @EvNewell1 et al. They used three data sets and two quantitative evaluation metrics: (1) preservation of pairwise distances and (2) reproducibility across repeated runs. UMAP won 6/6. (2/10)
Dec 16, 2019 16 tweets 7 min read
"The art of using t-SNE for single-cell transcriptomics" by @CellTypist and myself was published two weeks ago: nature.com/articles/s4146…. This is a thread about the initialisation, the learning rate, and the exaggeration in t-SNE. I'll use MNIST to illustrate. (1/16) FIRST, the initialisation. Most implementations of t-SNE use random initialisation: points are initially placed randomly and gradient descent then makes similar points attract each other and collect into clusters. We argue that random initialisation is often a bad idea (2/16).