Tweet

Rob Patro

Jan 6 • 20 tweets • 8 min read

@DongzeHe

Are you interested in performing splice-aware quantification of your #scrnaseq data, obtaining unspliced, spliced, and ambiguous UMI counts quickly & in <3GB of RAM? If so, check out the new manuscript by @DongzeHe, @CSoneson and me on #bioRxiv bit.ly/3vJr0Ji. 1/🧵

Understanding the origin of sequencing reads — the molecules from which they arise, the "gene" with which those molecules are associated, and the splicing status of those molecules — is a key task in single-cell RNA-seq quantification.

The short, (effectively) single-end nature of the reads used in popular technologies leads to situations in which it can be difficult or impossible to predict if a read was sequenced from an unspliced (nascent) or spliced (mature) RNA; these reads are designated as "ambiguous."

It turns out, however, that there are multiple ways in which one can define "ambiguous" reads. Prevailing convention considers UMIs associated with reads arising purely from exons as "spliced", so long a there is no contradictory evidence.

An alternative approach is to be more conservative, and to consider reads arising from within exons to simply be "ambiguous". This is the proposal made in a recent preprint (bit.ly/3GjCPuI), and the motivation is intuitive. But what is the effect of this choice?

One immediate effect is that a large fraction of reads in a typical sample now fall into this ambiguous category (see our Table 1). In fact, these numbers are ~5x higher than if one adopts the "traditional" allocation rules.

While this convention can avoid classifying exonic reads arising from unspliced molecules as spliced, it also doesn't recognize that exonic reads are not equally likely to arise from spliced and unspliced molecules (Fig. 2; caveat, from a vastly simplified simulation).

Since assignments made by alevin-fry are based on how the reference transcriptome is constructed, simply changing from a splici reference to a spliceu reference (spliced transcripts + one nascent transcript per gene) results, immediately, in the more conservative assignment rule.

Yet, most downstream analyses have no specific mechanism for dealing with ambiguous reads. So, these reads must either be discarded, or assigned some definitive status. Discarding them would throw away too much information, so a definitive “classification” is typically made.

It turns out that if you choose to classify all ambiguous reads as spliced in scRNA-seq, or unspliced in snRNA-seq (a choice certainly worthy of further investigation), then the more conservative assignment rules produce results that are quite similar to the existing conventions.

There are differences (and methods adopting the same classification rules tend to be more concordant, as one would expect), but broadly, the resulting quantifications are quite similar.

Unfortunately, we were not able to straightforwardly reproduce results consistent with previous claims to the contrary, but that may have been due to an annotation error (see Section 2.2 of the manuscript).

So, how best to assign splicing status still seems to be far from a solved problem! Compared to the relatively simple assignment rules currently used, one enticing direction is to build a more informed probabilistic model that incorporates additional evidence.

One valuable piece of evidence that is not currently used is the location of the likely priming site of the "technical read". This could be incorporated into the evaluation of splicing status. We make some suggestions to this effect in the paper.

Given that important aspects of this question remain to be resolved, we contend that quantification methods should categorize the splicing status of UMIs according to a clearly-defined set of rules & propagate all UMIs, tagged with their splicing status, to downstream tools.

Rather than deciding what to count and what to discard during mapping and quantification, the relevant information can be propagated to the count matrix so that downstream tools can make informed decisions based on their models and use cases.

Of course, this requires mapping & quantification be performed against a single index containing both spliced & unspliced (nascent) transcripts. To this end, we suggest the piscem index (introduced at #biodata22), which can map against this enhanced transcriptome in <3GB of RAM!

Piscem github.com/COMBINE-lab/pi… can be used directly upstream of alevin-fry as a replacement for the salmon mapper (which uses the pufferfish index). It unlocks many opportunities for indexing large sequence collections beyond scRNA-seq; stay tuned for a more thorough description.

In conclusion, this topic definitely deserves more attention (dare I say "more new algorithms")! In the meantime, it may be best to maximize the information you keep. Luckily, that's easy to do! 19/19

Link back to the top of the thread —

https://twitter.com/nomad421/status/1611352082686025728?s=20&t=1vsn6i_4-XeNFKCYVoSBvg

. 20/19

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @nomad421

Rob Patro

@nomad421

Jul 28, 2022

I've been writing some small tools in rust for an (exciting) upcoming project. A few of thoughts on this experience & what makes it so enjoyable compared to the (fast, compiled, statically-typed) alternatives! This is mostly about the tooling, let alone how great the lang is🧵

Getting a project started is *trivial*. All of the "boilerplate" is generated automatically by `cargo init`, I don't have to worry about how to set it up because there is "a way". 2/

Adding dependencies is also *trivial* and it just works. I need an argument parser — no need to track down a header only library or add some "submodule" to my CMake build — just add the line in cargo and it's done! 3/

Read 15 tweets

Rob Patro

@nomad421

Feb 1, 2022

@bielleogy

@bielleogy k-mers provide a way to compare sequences by directly looking at the composition of the "words" that make them up. A common analogy is to natural language processing and comparing text documents — imagine comparing 2 documents by counting the frac of words they have in common 1/

@bielleogy

@bielleogy There are many ways to measure this, but some common ones are metrics like the "Jaccard Index", which just counts the number of words (k-mers) in common divided by the total number of distinct words. 0 means no common words, 1 means all words are shared. 2/

@bielleogy

@bielleogy There are other metrics but this one is one of the more intuitive ones. So what about k-mer size? Well, unlike natural language, we don't have a simple definition of "words" in biological sequence; no obvious spaces, punctuation, etc. 3/

Read 24 tweets

Rob Patro

@nomad421

Dec 7, 2021

@gunesaynasinda

@gunesaynasinda In my experience, research is a near constant roller coaster like this. There are periods of huge productivity and right after they are over you often look over your shoulder wondering why the wave didn't continue unabated, and feel like your productivity has "slipped". 1/x

@gunesaynasinda

@gunesaynasinda But in reality, there are constant ups and downs, and the long-run average is only visible in a time-frame that's much larger than the gap between conference deadlines. Likewise, we often tend to look only at our peers that are the most productive at the current moment. 2/x

@gunesaynasinda

@gunesaynasinda Not paying attention when their previous wave peaks & returning attention again only when their next wave arrives. This leads to a skewed perspective of how "our" research is going in the larger context; everyone's bobbing up & down but we preferentially see those who are up. 3/x

Read 5 tweets

Rob Patro

@nomad421

Apr 8, 2020

RNA-seq data is often analyzed at the level of genes. This can provide a robust signal, but can also miss out on biologically important information like differences in isoform composition or dominant isoform usage. 1/n

On the other hand, tremendous progress has been made in transcript-level quantification, but certain inherent ambiguity can remain in the abundance estimates. This results from patterns of multi-mapping where no inference procedure can accurately resolve the origin of reads. 2/n

Yet, the total transcriptional output of group of transcripts sharing these complex multi-mapping patterns will have greatly-reduced inferential uncertainty, thus allowing more robust and confident downstream analysis. 3/n

Read 10 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Rob Patro

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @nomad421

Rob Patro

Rob Patro

Rob Patro

Rob Patro

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!