In a new paper with @JanLause & @CellTypist we argue that the best approach for normalization of UMI counts is *analytic Pearson residuals*, using NB model with an offset term for seq depth. + We analyze related 2019 papers by @satijalab and @rafalab. /1
Our project began when we looked at Fig 2 in Hafemeister & Satija 2019 (genomebiology.biomedcentral.com/articles/10.11…) who suggested to use NB regression (w/ smoothed params), and wondered:
1) Why does smoothed β_0 grow linearly? 2) Why is smoothed β_1 ≈ 2.3?? 3) Why does smoothed θ grow too??? /2
The original paper does not answer any of that.
Jan figured out that: (1) is trivially true when assuming UMI ~ NB(p_gene * n_cell); (2) simply follows from HS2019 parametrization & the magic constant is 2.3=ln(10); (3) is due to bias in estimation of overdispersion param θ! /3
We argue, based on theory, that NB regression should have sequencing depth (sum of UMI counts per cell) as an _offset_ term and should only have one free parameter, the intercept β_0. In which case it has a simple analytic solution!
Furthermore, we analyzed several negative control datasets (as @vallens Svensson 2020 did to show that UMI data are not zero-inflated: nature.com/articles/s4158…) and argue that technical overdispersion compared to Poisson is quite low, with technical θ~100 (for all genes). /6
Putting this all together, we get a simple formula for Pearson residuals in closed form. /7
As argued by HS2019 and THAI2019, Pearson residuals can be used for feature selection: simply select genes with the highest variance of Pearson residuals. And for normalization: use Pearson residuals for downstream processing, e.g. t-SNE. /8
Jan analyzed a bunch of different datasets and our conclusion is that (analytic) Pearson residuals tend to work better than normalization + log-transform or sqrt-transform (@flo_compbio), and also better (and much faster!) than the full-blown GLM-PCA. /9
Christoph and Rahul @satijalab wrote a response arguing that their 'regularized' approach yields a more meaningful gene selection than our 'analytic' approach. Brief reply in the next tweet. But importantly, we all agree that Pearson resid. are cool! /11
"Meaningful" PBMC genes listed by H&S as having higher variance in scTransform also have high analytic var (black dots) and would be selected. "Non-meaningful" genes all have much lower var (red dots). The difference is due to θ=100 vs θ≈10, not due to offset/regularization. /12
• • •
Missing some Tweet in this thread? You can try to
force a refresh
The input here is a 1,000,000 x 78,628 matrix X with X_ij = 1 if integer i is divisible by the j'th prime number, and 0 otherwise. So columns correspond to 2, 3, 5, 7, 11, etc. The matrix is large but very sparse: only 0.0036% of entries are 1s. We'll use cosine similarity. [3/n]
We get the spectrum by changing the "exaggeration" in t-SNE, i.e. multiplying all attractive forces by a constant factor ρ. Prior work by @GCLinderman et al. showed that ρ->inf corresponds to Laplacian eigenmaps. We argue that the entire spectrum is interesting. [2/n]
Here is a toy dataset with 20 Gaussians arranged on a line, like a necklace. With LE one sees the string. With t-SNE one sees the individual beads. [3/n]
Spent some time investigating history of "double descent". As a function of model complexity, I haven't seen it described before 2017. As a function of sample size, it can be traced to 1995; earlier research seems less relevant. Also: I think we need a better term. Thread. (1/n)
I don't like the term "double descent" because it has nothing to do with gradient descent. And nothing is really descending. It's all about bias-variance tradeoffs, so maybe instead of the U-shaped tradeoff one should talk about \/\-shaped? И-shaped? UL-shaped? ʯ-shaped? (3/n)
@NikolayOskolkov is the only person I saw arguing with that. Several people provided further simulations showing that UMAP with random init can mess up the global structure. I saw @leland_mcinnes agreeing that init can be important. It makes sense. (2/n)
A year ago in Nature Biotechnology, Becht et al. argued that UMAP preserved global structure better than t-SNE. Now @GCLinderman and me wrote a comment saying that their results were entirely due to the different initialization choices: biorxiv.org/content/10.110…. Thread. (1/n)
Here is the original paper: nature.com/articles/nbt.4… by @EtienneBecht@leland_mcinnes@EvNewell1 et al. They used three data sets and two quantitative evaluation metrics: (1) preservation of pairwise distances and (2) reproducibility across repeated runs. UMAP won 6/6. (2/10)
UMAP and t-SNE optimize different loss functions, but the implementations used in Becht et al. also used different default initialization choices: t-SNE was initialized randomly, whereas UMAP was initialized using the Laplacian eigenmaps (LE) embedding of the kNN graph. (3/10)
"The art of using t-SNE for single-cell transcriptomics" by @CellTypist and myself was published two weeks ago: nature.com/articles/s4146…. This is a thread about the initialisation, the learning rate, and the exaggeration in t-SNE. I'll use MNIST to illustrate. (1/16)
FIRST, the initialisation. Most implementations of t-SNE use random initialisation: points are initially placed randomly and gradient descent then makes similar points attract each other and collect into clusters. We argue that random initialisation is often a bad idea (2/16).
The t-SNE loss function only cares about preserving local neighbourhoods. With random initialisation, the global structure if usually not preserved, meaning that the arrangement of isolated clusters is largely arbitrary and depends mostly on the random seed. (3/16)