Rob Patro Profile picture
Feb 1 24 tweets 9 min read
@bielleogy k-mers provide a way to compare sequences by directly looking at the composition of the "words" that make them up. A common analogy is to natural language processing and comparing text documents — imagine comparing 2 documents by counting the frac of words they have in common 1/
@bielleogy There are many ways to measure this, but some common ones are metrics like the "Jaccard Index", which just counts the number of words (k-mers) in common divided by the total number of distinct words. 0 means no common words, 1 means all words are shared. 2/
@bielleogy There are other metrics but this one is one of the more intuitive ones. So what about k-mer size? Well, unlike natural language, we don't have a simple definition of "words" in biological sequence; no obvious spaces, punctuation, etc. 3/
@bielleogy So, instead, we essentially look at all possible words. This is sometimes called "shingling", but the idea is simple. Take a sliding window, and move it one nucleotide at a time across the sequence. Each window we observe defines a "word". 4/
@bielleogy Of course, if we do it this way, words overlap a lot (by all but their first and last characters). So, one decision we have to make is "how long should this window be"? The shorter the window, the more commonality we will find, but the less specific our comparison will be 5/
@bielleogy For example, if we looked at windows of size 3 — then basically every nontrivial sequence will contain every possible window (we could consider *frequency* of occurrence, but let's keep this simpler for now). 6/
@bielleogy On the other hand, if we chose windows of size 1000, then sequences may have very few exactly matching words — only very long stretches exactly shared between them. So, the k-mer size here controls the tradeoff between sensitivity and specificity. 7/
@bielleogy If k is too small, then many things — even very divergent sequences — will look very similar. If k is too large, then even a small number of mutations could eliminate all exactly matching windows. So, one generally wants to balance these to choose a "good" k. 8/
@bielleogy Exactly what that is can depend a lot on the application you're using it for. However, typical values for sequence comparison range from the teens and go up until ~30 or so. 9/
@bielleogy 31 is very common for a completely non-biological reason. It's the largest k-mer you can "pack" into a single "word" on most modern computers (64-bits, requiring 2 bits per nucleotide -> 31-mer). 10/
@bielleogy Now at this point, hopefully what a k-mer based comparison like this is doing is somewhat intuitive, but how is it different than alignment, and why would we want to use it instead? 11/
@bielleogy Well, one difference is that alignment typically poses many more constraints. When we count the fraction of shared k-mers, we treat them as a "bag of words", we don't care e.g. the order in which they appear. 12/
@bielleogy On the other hand, in alignment, we typically enforce that matches between nucleotides are "co-linear" — that is, if we match nucleotide i on the first sequence to nucleotide j on the second sequence, then we can't match i' > i on the first sequence to j' < j on the second 13/
@bielleogy Another way of saying this is that the lines we see in e.g. a BLAST alignment don't cross. This is a structural constraint on what we expect from a typical alignment. There are other differences too — with alignment we can develop very sophisticated scoring functions. 14/
@bielleogy For example, in protein space, maybe substituting a hydrophobic amino acid with another hydrophobic one in an alignment costs us "less" than substituting with a hydrophilic one. Also, we can craft the costs of gaps in the alignment to match biological intuition. 15/
@bielleogy Of course, assuring that intuition is reliable and generalizable is another problem entirely. So quickly, back to the k-mers. If they give up this extra structure and flexibility, why do a comparison like this? 16/
@bielleogy There are *many* reasons, argued in great depth in the literature, but here are 2. (1) It can be *much* faster to measure something like the Jaccarrd Index than to find the optimal alignments between sequences; esp. when many sequences are involved. 17/
@bielleogy So these approaches are generally much more computationally efficient and much faster! (2) Because it doesn't impose all of the same kinds of constraints as alignments, it can sometimes be more sensitive, and capture similarity that certain kinds of alignments would miss. 18/
@bielleogy For example, the fraction of shared k-mers may not be affected much by an inversion, but an alignment score would typically be affected (which is why special considerations must be made for alignments to contain such things). 19/
@bielleogy Finally, it's worth noting that, at a very large scale, even computing something like the Jaccard Index can be prohibitive, if we want to compute it between many different things (e.g. genomes). That's where tools like Mash (genomebiology.biomedcentral.com/articles/10.11…) come in. 20/
@bielleogy It turns out that the Jaccard Index can be very efficiently approximated — with an error that we can understand under certain assumptions — by sampling in an intelligent way. This allows pairwise comparison of the similarity within *huge* sequence collections. 21/
@bielleogy Finally, depending on the context, the choice need not be either-or. Once you know an approximate similarity, you can tell which pairs of sequences *definitely* won't have a good alignment, and only perform costly alignment on the pairs that might. 22/
@bielleogy So that's my initial 22-tweet description here, meant to be very general given the framing of your question. However, I'd be happy to discuss more if you have any follow-up questions!
@bielleogy I should also mention here for context – why 31, why not 64/2 -> 32? Well, it's common to prefer *odd* k-mers because this ensures that a k-mer can not be it's own reverse complement. So 31 is the largest *odd* k-mer size that can be squeezed into a machine word.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Rob Patro

Rob Patro Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @nomad421

Dec 7, 2021
@gunesaynasinda In my experience, research is a near constant roller coaster like this. There are periods of huge productivity and right after they are over you often look over your shoulder wondering why the wave didn't continue unabated, and feel like your productivity has "slipped". 1/x
@gunesaynasinda But in reality, there are constant ups and downs, and the long-run average is only visible in a time-frame that's much larger than the gap between conference deadlines. Likewise, we often tend to look only at our peers that are the most productive at the current moment. 2/x
@gunesaynasinda Not paying attention when their previous wave peaks & returning attention again only when their next wave arrives. This leads to a skewed perspective of how "our" research is going in the larger context; everyone's bobbing up & down but we preferentially see those who are up. 3/x
Read 5 tweets
Apr 8, 2020
RNA-seq data is often analyzed at the level of genes. This can provide a robust signal, but can also miss out on biologically important information like differences in isoform composition or dominant isoform usage. 1/n
On the other hand, tremendous progress has been made in transcript-level quantification, but certain inherent ambiguity can remain in the abundance estimates. This results from patterns of multi-mapping where no inference procedure can accurately resolve the origin of reads. 2/n
Yet, the total transcriptional output of group of transcripts sharing these complex multi-mapping patterns will have greatly-reduced inferential uncertainty, thus allowing more robust and confident downstream analysis. 3/n
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

:(