Tweet

Lior Pachter

May 11 • 25 tweets • 13 min read

@sinabooeshaghi

The analysis of single-cell RNA-seq data begins with "normalizing" counts. In a preprint with @sinabooeshaghi, @IngileifBryndis & @agalvezmerchan, we examine the assumptions and challenges of normalization, benchmark methods, and motivate solutions: biorxiv.org/content/10.110… 🧵 1/

We weren't particularly interested in studying normalization, but faced a vexing problem related to normalizing feature barcodes. In scouring the literature for solutions to our problem, we became increasingly confused rather than enlightened about how to normalize our data. 2/

@const_ae

We started with the excellent recent review / expository article by @const_ae & @wolfgangkhuber that looks at strengths & weaknesses of many methods: biorxiv.org/content/10.110…. It became clear to us that a central question is how to normalize depth w/ gene count overdispersion. 3/

An important publication focusing on this question is the sctransform paper written by Hafemeister & Satija (from the Seurat team) in 2019:
genomebiology.biomedcentral.com/articles/10.11…. They proposed using Pearson residuals derived from regularized negative binomial regression. 4/

The term "sctransform" is both a statistical method, as described in the Hafemeister & Satija paper, but it's also software implemented as part of Seurat. In trying to understand exactly what the two are, and how they relate to each other, we descended into a squirrel hole.. 5/

Starting with the software, it became a problem that Seurat is not published, and there is no explanation anywhere of exactly how normalization is implemented in the program or how it's used in the various Seurat functions. So we made a map for ourselves by reading the code. 6/

A lot of the normalization literature focuses on statistical aspects of various transformations, but the reality of Seurat is that results are much more influenced by software engineering decisions; different normalizations interact with each other in non-trivial ways. 7/

By the way, it's a similar story in Scanpy, but a full comparative analysis & explanation is a matter for another 🧵 I will say, as have others, that it's a serious problem for the field with implications for many frequently performed analyses. 8/

https://twitter.com/davisjmcc/status/1524178348863520770?s=20&t=KlkFnIEQ320hQbXnw5c9ew

@JanLause

The engineering matters but is not to say statistical details are not important. As @JanLause, @CellTypist and @hippopedoid have shown, the sctransform gene-specific overdispersion estimates are not only superfluous but possibly detrimental. genomebiology.biomedcentral.com/articles/10.11… 9/

But the reality of #scRNAseq normalization right now is that in packages such as Scanpy and Seurat, choosing a "method" means choosing a collection of methods, not a single one. We asked whether there can be a single transformation that could serve multiple tasks. Tl;dr yes! 10/

@CataVallejosM

To clarify what properties a transformation should have, we surveyed the needs of several common tasks (e.g., dim. reduction, differential expression, marker gene finding etc.) BTW we're not the first to do this, see e.g. nature.com/articles/nmeth… by @CataVallejosM et al. 11/

But we took a nuanced view on some tasks. E.g., we distinguish differential expression from finding markers, two tasks which are unfortunately frequently confounded. A good marker is not only higher (in expression) in a cell type relative to others, it is also specific. 12/

@LBCastle

Consider, e.g., the heatmap in @LBCastle et al. from the @NIHDirector (Francis Collins') lab. Markers require diff. abundance not just across cell types, but also genes within cell types (rows & columns). This calls for monotonic normalization, otherwise cols. are scrambled. 13/

We concluded that three properties of normalization are key: variance stabilization, depth normalization, and monotonicity. Details of how these are relevant for different analysis tasks are in our preprint. We benchmarked several popular methods with respect to these. 14/

@jeremiebreda

The word "popular" here is carrying a lot. There are tons of normalization methods for #scRNAseq that have been developed, including some we think are very interesting, e.g. Sanity by @jeremiebreda, @ZavolanLab and @NimwegenLab. 15/ nature.com/articles/s4158…

@fabian_theis

But we limited ourselves, for now, to the main methods implemented in Seurat, Scanpy and scprep. Figuring out how users actually use these packages was a Herculean task in and of itself. Only Scanpy from the @fabian_theis lab is published. genomebiology.biomedcentral.com/articles/10.11… 16/

Seurat, which is probably the most popular package for #scRNAseq, has different normalizations recommended in different versions, and multiple versions are in use because R users are reluctant to update R just to switch to a newer version.

https://twitter.com/lobrowR/status/1180121094189023233?s=20&t=Iyu1bz8kZOjvc-Oa2Li3xQ

17/

@agalvezmerchan

Now to the fun part... we benchmarked on 526 single-cell RNA-seq datasets comprised of ~140 billion reads of data. The uniform processing and analysis of this data was done by two students in my group: @agalvezmerchan and @sinabooeshaghi. 18/

The figure in the previous tweet is mind-boggling to ponder. It summarizes the results for this massive analysis; individual dataset benchmarks are in our supplement which amounts to 1596 pages (!) as a result. An example from a single one of these datasets is shown below. 19/

@agalvezmerchan

How @agalvezmerchan and @sinabooeshaghi automated all of this is a matter for another 🧵, because it involved the development of many tools and numerous ideas. As an example, see ffq:

https://twitter.com/lpachter/status/1522322188493197312?s=20&t=ckG6lwqqlbdeg95D9t1t2w

20/

What did we learn?
1. Benchmarking on a handful of datasets, as has been the practice to date, is insufficient.
2. The statistical details of methods matter, but even suboptimal transforms (e.g. sqrt) are fine.
3. The PFlog1pPF works well as an overall single transform. 21/

What is PFlog1pPF? ideally normalization procedures should be customized to tasks, but practically it's useful to have a single normalized count matrix for many applications. PFlog1pPF stands for proportional fitting followed by log1p followed by PF; details in the preprint. 22/

There's much more in the preprint. E.g., the very premise of normalization, namely that depth should/must be normalized and variance stabilized is itself problematic, as transforms don't distinguish technical from biological "noise". We will also return to this in a future 🧵23/

The supplement also has many results that may be of practical interest to users. E.g., it turns out that the results of sctransform in practice don't differ much from (the much faster to compute) scalelog1pCP10k. 24/

In summary, we hope that our uniformly processed datasets are useful for future work on normalization (available here github.com/pachterlab/BHG…), and that our observations on method performance / software engineering considerations are helpful for practitioners of #scRNAseq. 25/25.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @lpachter

Lior Pachter

@lpachter

Mar 22

"..antibody-based and lipid-based methods are simple, straightforward and generally applicable to a wide range of single cell applications and platforms, while genetic cell labeling and chemical labeling with oligonucleotides can be more challenging." Huh? genomebiology.biomedcentral.com/articles/10.11…

We have found the exact opposite to be true. nature.com/articles/s4158…

Tagging with chemical oligos does not require design of antibodies to specific proteins. Hence it is essentially universal with respect to organism, which is why it can be used to multiplex, say, jellyfish. science.org/doi/10.1126/sc…

Read 4 tweets

Lior Pachter

@lpachter

Mar 17

@UCBerkeley

When I went on the job market for my first job after I had been a postdoc I applied to only 3 schools where I really wanted to go (why waste people's time?). I got only one job (@UCBerkeley). 1/

https://twitter.com/smgaddis/status/1504073369775271940

I obviously had no other offers, but someone else in my field (computational biology) who applied to a different department did. The chair of my department wrote to the dean and explained that it would be fair to start both of us at the same salary. 2/

The dean wrote back and declined, explaining that "while I agree with you that it would be the right thing to do, in the absence of an outside offer [for Lior] I cannot approve a salary beyond the minimum." I still have the letter. 3/

Read 6 tweets

Lior Pachter

@lpachter

Mar 9

This #covid19 chart from Iceland shows the data from a small country in the North Atlantic, but it tells the story of #covid19 worldwide.🧵1/

Mitigation procedures / lockdowns don't work? Why yes... they do! 2/

Indoor parties before and during Christmas without vaccination or masks aren't a problem? Well yes... they are! 3/

Read 7 tweets

Lior Pachter

@lpachter

Mar 7

@CamilleThomasOF

I recently saw a moving performance of Elgar's cello concerto by @CamilleThomasOF with @PBortolameolli conducting the @LAPhil. I've probably listened to this piece thousands of times and know all the famous recordings, but I'd never heard it live. @CamilleThomasOF was incredible.

@CamilleThomasOF

She will obviously draw comparisons to Jacqueline du Pré, but comparing a live performance to a recording is a fool's errand. What I can say that I heard in @CamilleThomasOF's performance tones, sounds, and ideas that I never knew were in the piece.

@CamilleThomasOF

Elgar's cello concerto was written shortly after World War I, and @CamilleThomasOF's performance against a backdrop of violence that echoes some of the tragedies not only of the Second World War, but also of the Great War, was profound.

Read 5 tweets

Lior Pachter

@lpachter

Mar 7

I recently saw a lecture by Eric Lander in which he talks about the history of comparative genomics, puts up a slide with pictures of 23 animals (monkeys, a dog, a horse etc.) + one black baby, and then refers to them all as "cute animals" 😱. 1/6

Trying to afford the speaker the benefit of doubt... I figured perhaps he misspoke and meant that the animal genomes were intended to better understand the human. But the history of genomics is whites only. Even in 2009, 96% of GWAS had been on whites. sciencedirect.com/science/articl… 2/6

So why pretend genomics prioritized black kids, when the reality in 2017 was that "Individuals with African ancestry [were] not receiving the same level of care as individuals of European ancestry due to limitations in available data"? link.springer.com/article/10.100… 3/6

Read 6 tweets

Lior Pachter

@lpachter

Feb 14

@GorinGennady

If you work w/ single-cell RNA-seq & are performing RNA velocity analyses, you might find this @GorinGennady et al. preprint w/ Meichen Fang & Tara Chari of interest. It's a deep dive into the method, and navigation of the 67 pages may be aided w/ this🧵1/
biorxiv.org/content/10.110…

@VolkerBergen

As a starting point, it's worth noting that the two popular packages right now, scVelo (@VolkerBergen et al. from @fabian_theis' lab) and velocyto (@GioeleLaManno et al. from the @slinnarsson and @KharchenkoLab labs), yield discordant results on a simple example (see below). 2/

The inferred directions should recapitulate a known differentiation trajectory from radial glia to mature neurons. However, scVelo reverss the trajectory, despite "generalizing" velocyto & relying on a better model. Also sometimes it's scVelo that works well. So what gives? 3/

Read 27 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Lior Pachter

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @lpachter

Lior Pachter

Lior Pachter

Lior Pachter

Lior Pachter

Lior Pachter

Lior Pachter

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?