Simona Cristea Profile picture
Apr 27 18 tweets 5 min read Twitter logo Read on Twitter
Only a matter of time before a paper formalized this exercise:

Automated #scRNAseq cell type annotation with GPT4, evaluated across five datasets, 100s of tissues & cell types, human and mouse.

A🧵below with my thoughts on how such tools will change how #Bioinformatics is done. Image
I'll start with a quick summary of the paper, such that we're all on the same page.

(The paper is also a super quick read, literally only 3 pages of text, among which 1 is GPT prompts).

Here's the link to the preprint

biorxiv.org/content/10.110…
The paper looked at 4 already-annotated public datasets: Azimuth, Human Cell Atlas, Human Cell Landscape, Mouse Cell Atlas.

Differentially Expressed Genes characterizing every cluster (DEGe) in these studies were generally available with the publications & were also downloaded.
In addition to these 4 sets of DEGs, the authors also downloaded a large set of marker genes from the Human Cell Atlas.

GPT4 was prompted (with basic prompts) to identify cell types in these 5 scenarios, given some of the top DEGs in each list. ImageImage
Assessment:

“fully match” if GPT4 & original annotation is the same cell type;

“partially match” if the two annotations are similar, but distinct cell types (e.g. monocyte & macrophage);

“mismatch” if the two annotations refer to different types (e.g. T cell & macrophage). Image
GPT4's performance in matching the cell type annotation in the original publications is quite impressive.

Of course some tissues are better than others, but given the little effort involved, it certainly is able to provide a good intuition about the analyzed data.
The authors note that GPT4 is in best agreement with the original annotations when using only top 10 differential genes, and using more may reduce agreement.

This is an insightful observation, which essentially means that GPT4 is a great "literature summarizer" for what is known
This also means that GPT4 is great for annotations heavily relying on canonical marker genes.

So, if your pipeline involves lots of comparing with literature/googling/extracting expert info, as opposed to just using algorithms on DEGs, then GPT4 can be very helpful.
Another straight-forward observation is that GPT4 does best when the cell type is more homogeneous, and worst when it is a heterogeneous mixture (see stroma).

(totally expected)

Still, there's prompting tricks that one can do to increase accuracy.
Now, what does this all mean?

1. We need to understand that cell type assignment is the perfect *type of problem* on which chatGPT excels

Why?

Ultimately, most cell assignments boil down to expert knowledge,which is nothing more than mirroring literature in a structured manner
All existing algorithms work by assessing similarities with existing databases/datasets, so in the end we need to choose among multiple ways to do the same thing.

Chatting with GPT4 happens to be a super quick way to get this task done, in a pretty reasonable manner.
2. In the near future, GPT-like models will be able to *reliably* automate most of bioinformatics pipeline running and interpretation.

For many Bioinformaticians and Data Scientists, this represents today a core part of the job.
- Writing & debugging code
- Running entire pipelines
- Interpreting results
- Creating reports
- Creating presentations based on results & reports

All these tasks can now be done in a fraction of the time that they used to occupy.

And the efficiency is quickly going up.
This means more time for bioinformaticians to focus on higher-level tasks, for ex:
- understand the underlying biology
- structure & formulate questions
- learn about the reasoning behind the methods used (why is a method better than others, what are the limitations)
Similarly for more advanced algorithms (such as cell type assignment):

instead of getting lost for weeks/months in the weeds of installing/running/debugging such algorithms, chatGPT-like tools will be able to provide reasonable answers in literally 5 minutes.
Of course, such results will likely be an initial attempt to solve the task, requiring further curation/inspection, and sometimes redoing.

But the time saved throughout the process is real, massive, and most importantly, it compounds (time is saved in a cascade of tasks).
3. For people who still think “how can I trust this, billions of parameters, this is not interpretable, this is not perfect”

REMEMBER

*none* of the algorithms out there (incl. expert intuition for e.g. cell assignment) is 100% perfect

perfect doesn’t usually exist in this game
4. None of this is to say that Bioinformaticians won't have an important role in driving the future biomed research.

On the contrary, the role of data & models is now more important than ever.

But biomedical science will overall move towards more & more interdisciplinarity.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Simona Cristea

Simona Cristea Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @simocristea

Mar 30
The paper I am sharing today is a thoughtful philosophical perspective from @sdomcke & @JShendure proposing a new organizational framework for single cell data, as an alternative to e.g Human Cell Atlas

Compelling read for both lovers❤️ & skeptics🤔 of single cell genomics

🧵🧵
This thread is organized as follows:

1️⃣ The need to organize Biology
2️⃣ How to organize cell types?
3️⃣ A consensus ontology
4️⃣ Structure & representation of the cell reference tree
5️⃣ Resolution of tree labels
6️⃣ Example tree
7️⃣ Human tree
8️⃣ Thoughts
1️⃣ The need to organize Biology

The paper starts with a very thoughtful phrasing, namely that Biology engages in "summarizing" the natural world.

Once we accept this, it becomes clear that we need to give consideration to how biological entities are "classified".
Read 32 tweets
Mar 24
Do you need to analyze Spatial Transcriptomics data, but are lost in the endless sea of methods?

Here's an explainer of the new @NatureComms paper benchmarking 18 spatial cellular deconvolution methods🧵🧵

nature.com/articles/s4146…
This thread is organized as follows:

1️⃣ Intro to Spatial Transcriptomics
2️⃣ Intro to Cellular Deconvolution
3️⃣ Methods benchmarked
4️⃣ Datasets used (real & simulated)
5️⃣ Performance assessment
6️⃣ Benchmarking results
7️⃣ Accuracy
8️⃣ Robustness
9️⃣ Usability
🔟 Guidelines
1️⃣ What is Spatial Transcriptomics & why is it important?

Spatial Transcriptomics (Method of the Year 2020) is a fast evolving field.

It holds great potential to further our understanding of development & disease, by placing cells in their spatial native tissue context.
Read 25 tweets
Mar 22
🚨Our new study is out @CellReports!

We use single cell protein quantification & single cell FISH to map #spatial interactions in genetic mosaicism & tumor microenvironment in #Glioblastoma!

Wonderful collaboration w/ @janiszewska_lab @DalitEngelhardt
@Kacper_W_PhD

Deep dive👇 Image
First, some context.

Glioblastoma (GBM) is one of the deadliest, most aggressive cancers that exist, with a median survival of only 15 months.

In GBM, 'single cell heterogeneity' are not simply buzzwords.

Rather, this immense heterogeneity is a main reason of treatment failure Image
Recent work demonstrated that single GBM tumors are mosaics of cells in different states, each associated with distinct genomic driver alterations.

While transitions between cell states can occur, each genetic driver favors a particular cell state.

tinyurl.com/mrx2jpj8
Read 32 tweets
Feb 27
I need to raise awareness about an important point in #scRNAseq data analysis, which, in my opinion, is not acknowledged enough:

‼️In practice, most cell type assignment methods will fail on totally novel cell types. Biological/expert curation is necessary!

Here's one example👇
Last year, together with @LabPolyak @harvardmed, we published a study in which we did something totally awesome: we experimentally showed how a TGFBR1 inhibitor drug 💊 prevents breast tumor initiation in two different rat models!

Here's a detailed thread on this paper:
As you can imagine, this is a big thing. Treating tumors is already hard, preventing them is even harder!

Obviously, the most burning question for us then became: what is the drug actually doing to prevent tumor initiation?

Or, what is different in treated vs. control cells?
Read 17 tweets
Feb 23
🚨New #SpatialTranscriptomics #Bioinformatics data resource out in @naturemethods.

SODB, a platform with >2,400 manually curated spatial experiments from >25 spatial omics technologies & interactive analytical modules.

This🧵will walk you through all the features of SODB [1/33] Image
First, some background.

Spatial technologies complement classical genomics by also providing information about spatial context & tissue organization in:

- embriogenesis
- disease development
- normal tissue homeostasis

The field has exploded 🔥 in the past 2 years. [2/33] Image
But, data from different studies is stored in different configurations/repositories, such as:

- GEO
- zenodo
- fig share
- SingleCellPortal
- IONPath for MIBI
- 10XGenomics website

This makes data sharing & re-analysis challenging.

Databases exist, but have limitations. [3/33]
Read 33 tweets
Feb 10
Interested in how classical rule-based modular biology & #deeplearning fit together for the design of artificial proteins?

A new preprint combines these two modeling strategies to generate novel proteins!

Let's take a closer look at this innovative framework🧵👇
This method comes from the @MetaAI FAIR protein folks: @BrianHie, @salcandido, @ebetica, @OriKabeli, @proteinrosh, @nikismetanin, @TomSercu, @alexrives and is available as a preprint.

biorxiv.org/content/10.110…
The proposed methodology has 3 steps:

1. Define a generative program consisting of a syntax tree & a set of hierarchical constraints
2. Compile the program in (1) into an energy function
3. Optimize the function via simulated annealing. The solutions are the artificial proteins.
Read 20 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(