Another core similarity between NLP & single cell biology is the large and ever-growing size of publicly available #scRNAseq data (e.g human cell atlas) to be used for training.
Can NLP models also understand intrinsic logics of single cell biology & develop "emergent thinking"?
These are compelling parallels, and deep learning tools leveraging them do exist.
But AFAIK none of these tools is generative & so versatile as scGPT.
‼️However: while words in a sentence are sequential, genes in a cell are not.
This is tricky, as GPT predicts the next token auto-regressively, given the sequence of previous tokens.
To understand how this issue is solved here, we'll need to dive into the details of the model.
But before discussing about the methodological details, let's see how scGPT is built and what its performance is on single cell analyses (hint: it does very well).
Different applications correspond each to a different fine-tuning routine, with its own specific objective function
During training on 10.3 million scRNAseq blood & bone marrow cells in CellXGene, scGPT simultaneously learns cell & gene representations
The model gradually learns to generate gene expression of cells based on the condition & expression of existing cells. cellxgene.cziscience.com
The pre-trained model can be fine-tuned to new datasets & specific tasks.
The authors offer fine-tuning pipelines for several tasks commonly done in single cell analysis, s.a. integration, batch correction etc.
1. Integration of multiple scRNA-seq datasets with batch correction
scGPT was benchmarked against scVI (also a deep learning model), Harmony and Seurat, on integrating two datasets: PBMC (2 batches) & Immune Human (10 batches).
scGPT performed best, as assessed by multiple biological conservation metrics (remember, the goal here is to minimize the spread of cells of same cell type).
scGPT (dark pink) is consistently a higher bar than the rest.
Still, all tools seem to do pretty well generally❗
2. Cell type annotation
For this task, the pre-trained scGPT model was fine-tuned using cross-entropy loss against ground-truth labels from a new reference dataset of human pancreas cells.
It was then tasked to identify cell types on another human pancreas dataset.
scGPT did very well on this task, classifying cells almost perfectly.
However, the cell type labels to be inferred are pretty general & biologically distinct, so separating them is quite easy, regardless of how strong the model is.
Interesting to now pause & reflect for a bit.
We've just seen that also chatGPT can do cell type assignment, by essentially literature browsing
However, there's a very important difference here
scGPT is, in a sense, the opposite of literature browsing, as it is fully automatic
Genes are the smallest unit of information, equivalent to a word in natural language generation.
The gene names are used as tokens.
Each gene is assigned a unique integer identifier id(gj) within the complete vocabulary of tokens.
The input gene tokens of each cell i are a vector tg(i) of pre-defined length M = the (variable) number of highly variable genes.
This conceptual parallel is nice & easy, as different sets of gene tokens can be integrated into a common vocabulary by taking the union set of all genes across all the studies analyzed in an application (with however potential computing restrictions on M during pre-training).
Next, the authors choose to normalize their input data (raw counts) in a different manner than the usual normalization routines (e.g TPM or log1p).
The argument is that absolute values can convey different "semantic" meanings in different scenarios (for example multiple batches)
This approach is both interesting & quite extreme: the raw counts for each cell are separated into a number B of bins according to expression values
All genes within a bin get assigned the same value, which is the index of the bin( e.g all genes in bin number 10 are assigned 10)
Before binning, the log1p transformation is applied, followed by selecting M most highly variable genes.
Therefore: the input for cell i are log1p expression values of the M highest variable genes, binned across B bins (not sure about actual values of B).
Here's what this means
Binning the trades noise reduction with removing biological variability.
I am curious how impactful this step is on the model's performance & how other normalizations do.
Tuning the normalization might be a way to improve the model.
Another interesting strategy is that, during pre-training, the input is restricted to only genes with non-zero expression for each input cell.
On the contrary, during fine-tuning, all genes (both zero and non-zero) are included.
The model also accommodates gene condition tokens, which are indicators of class belonging among the genes, such as belonging to functional pathways or perturbation alterations.
Each cell is considered a "sentence" composed of genes, and its representation is obtained by aggregating the gene-level representations.
As when representing an NLP transformer, there is a special token that indicates cell membership, i.e. which genes belong together to a cell.
The model allows adding condition tokens among cells as well, for indicating different sequencing modalities, batches, perturbation states & others.
The way this is modeled is as a gene-level token, repeated M times for each cell i.
The cell-level tokens are not used as input to the transformer blocks, rather concatenated with the transformer output before fine-tuning (e.g. concatenation of cell representation with batch embedding in the task of scRNAseq integration).
Finally, how about the sequentiality among genes, as for text generation?
The answer here is a specialized attention masking procedure that defines the *order of prediction* from attention scores.
This is an innovative idea, which also helps capture interactions among genes.
It works by iteratively predicting expression of a new genes set, which in turn become the “known genes” in next iteration of attention computation.
In iterative generation rounds, this creates an order among genes from prediction confidence, mimicking auto-regressive generation
That's what scGPT is all about!
Now, clear future directions:
1. Multiple normal cell types (not only immune) - Likely easy
2. Disease - Likely very difficult, but also very interesting (think cancer heterogeneity)
3. Spatial location - can it also be learned & generated?
The single cell genomics community has meany reasons to be hyped. Not only about this pre-print itself, but also about what it represents, and especially the potential of such type of models.
There’s many things that we have no idea how to understand or predict in biology.
LLMs are a new shot at this complex problem.
Intuitively, it’s a very exciting direction to discover.
It’s impressive that scGPT performs so well on many different tasks. Still, its performance is comparable (indeed better!) with other existing tools.
So it’s natural to ask oursevles: is there any real gain? Are we willing to trade interpretability for modest performance gains?
In fact, the biggest advantages of this framework are its modularity & scalability.
On the basis of a single pre-trained model, multiple different pre-training routines, corresponding to different single cell applications, can be built relatively easy.
This is nicely extendable
There are no words on how excited I am about the potential of applying this framework to disease tissues & understanding the complex language of cancer genomics.
This preprint was a great read, looking forward for trying out the finetuned models!
The paper looked at 4 already-annotated public datasets: Azimuth, Human Cell Atlas, Human Cell Landscape, Mouse Cell Atlas.
Differentially Expressed Genes characterizing every cluster (DEGe) in these studies were generally available with the publications & were also downloaded.
The paper I am sharing today is a thoughtful philosophical perspective from @sdomcke & @JShendure proposing a new organizational framework for single cell data, as an alternative to e.g Human Cell Atlas
Compelling read for both lovers❤️ & skeptics🤔 of single cell genomics
🧵🧵
This thread is organized as follows:
1️⃣ The need to organize Biology
2️⃣ How to organize cell types?
3️⃣ A consensus ontology
4️⃣ Structure & representation of the cell reference tree
5️⃣ Resolution of tree labels
6️⃣ Example tree
7️⃣ Human tree
8️⃣ Thoughts
1️⃣ The need to organize Biology
The paper starts with a very thoughtful phrasing, namely that Biology engages in "summarizing" the natural world.
Once we accept this, it becomes clear that we need to give consideration to how biological entities are "classified".
We use single cell protein quantification & single cell FISH to map #spatial interactions in genetic mosaicism & tumor microenvironment in #Glioblastoma!
I need to raise awareness about an important point in #scRNAseq data analysis, which, in my opinion, is not acknowledged enough:
‼️In practice, most cell type assignment methods will fail on totally novel cell types. Biological/expert curation is necessary!
Here's one example👇
Last year, together with @LabPolyak@harvardmed, we published a study in which we did something totally awesome: we experimentally showed how a TGFBR1 inhibitor drug 💊 prevents breast tumor initiation in two different rat models!