Tweet

Rob Patro

Feb 1 • 24 tweets • 9 min read

@bielleogy

@bielleogy k-mers provide a way to compare sequences by directly looking at the composition of the "words" that make them up. A common analogy is to natural language processing and comparing text documents — imagine comparing 2 documents by counting the frac of words they have in common 1/

@bielleogy

@bielleogy There are many ways to measure this, but some common ones are metrics like the "Jaccard Index", which just counts the number of words (k-mers) in common divided by the total number of distinct words. 0 means no common words, 1 means all words are shared. 2/

@bielleogy

@bielleogy There are other metrics but this one is one of the more intuitive ones. So what about k-mer size? Well, unlike natural language, we don't have a simple definition of "words" in biological sequence; no obvious spaces, punctuation, etc. 3/

@bielleogy

@bielleogy So, instead, we essentially look at all possible words. This is sometimes called "shingling", but the idea is simple. Take a sliding window, and move it one nucleotide at a time across the sequence. Each window we observe defines a "word". 4/

@bielleogy

@bielleogy Of course, if we do it this way, words overlap a lot (by all but their first and last characters). So, one decision we have to make is "how long should this window be"? The shorter the window, the more commonality we will find, but the less specific our comparison will be 5/

@bielleogy

@bielleogy For example, if we looked at windows of size 3 — then basically every nontrivial sequence will contain every possible window (we could consider *frequency* of occurrence, but let's keep this simpler for now). 6/

@bielleogy

@bielleogy On the other hand, if we chose windows of size 1000, then sequences may have very few exactly matching words — only very long stretches exactly shared between them. So, the k-mer size here controls the tradeoff between sensitivity and specificity. 7/

@bielleogy

@bielleogy If k is too small, then many things — even very divergent sequences — will look very similar. If k is too large, then even a small number of mutations could eliminate all exactly matching windows. So, one generally wants to balance these to choose a "good" k. 8/

@bielleogy

@bielleogy Exactly what that is can depend a lot on the application you're using it for. However, typical values for sequence comparison range from the teens and go up until ~30 or so. 9/

@bielleogy

@bielleogy 31 is very common for a completely non-biological reason. It's the largest k-mer you can "pack" into a single "word" on most modern computers (64-bits, requiring 2 bits per nucleotide -> 31-mer). 10/

@bielleogy

@bielleogy Now at this point, hopefully what a k-mer based comparison like this is doing is somewhat intuitive, but how is it different than alignment, and why would we want to use it instead? 11/

@bielleogy

@bielleogy Well, one difference is that alignment typically poses many more constraints. When we count the fraction of shared k-mers, we treat them as a "bag of words", we don't care e.g. the order in which they appear. 12/

@bielleogy

@bielleogy On the other hand, in alignment, we typically enforce that matches between nucleotides are "co-linear" — that is, if we match nucleotide i on the first sequence to nucleotide j on the second sequence, then we can't match i' > i on the first sequence to j' < j on the second 13/

@bielleogy

@bielleogy Another way of saying this is that the lines we see in e.g. a BLAST alignment don't cross. This is a structural constraint on what we expect from a typical alignment. There are other differences too — with alignment we can develop very sophisticated scoring functions. 14/

@bielleogy

@bielleogy For example, in protein space, maybe substituting a hydrophobic amino acid with another hydrophobic one in an alignment costs us "less" than substituting with a hydrophilic one. Also, we can craft the costs of gaps in the alignment to match biological intuition. 15/

@bielleogy

@bielleogy Of course, assuring that intuition is reliable and generalizable is another problem entirely. So quickly, back to the k-mers. If they give up this extra structure and flexibility, why do a comparison like this? 16/

@bielleogy

@bielleogy There are *many* reasons, argued in great depth in the literature, but here are 2. (1) It can be *much* faster to measure something like the Jaccarrd Index than to find the optimal alignments between sequences; esp. when many sequences are involved. 17/

@bielleogy

@bielleogy So these approaches are generally much more computationally efficient and much faster! (2) Because it doesn't impose all of the same kinds of constraints as alignments, it can sometimes be more sensitive, and capture similarity that certain kinds of alignments would miss. 18/

@bielleogy

@bielleogy For example, the fraction of shared k-mers may not be affected much by an inversion, but an alignment score would typically be affected (which is why special considerations must be made for alignments to contain such things). 19/

@bielleogy

@bielleogy Finally, it's worth noting that, at a very large scale, even computing something like the Jaccard Index can be prohibitive, if we want to compute it between many different things (e.g. genomes). That's where tools like Mash (genomebiology.biomedcentral.com/articles/10.11…) come in. 20/

@bielleogy

@bielleogy It turns out that the Jaccard Index can be very efficiently approximated — with an error that we can understand under certain assumptions — by sampling in an intelligent way. This allows pairwise comparison of the similarity within *huge* sequence collections. 21/

@bielleogy

@bielleogy Finally, depending on the context, the choice need not be either-or. Once you know an approximate similarity, you can tell which pairs of sequences *definitely* won't have a good alignment, and only perform costly alignment on the pairs that might. 22/

@bielleogy

@bielleogy So that's my initial 22-tweet description here, meant to be very general given the framing of your question. However, I'd be happy to discuss more if you have any follow-up questions!

@bielleogy

@bielleogy I should also mention here for context – why 31, why not 64/2 -> 32? Well, it's common to prefer *odd* k-mers because this ensures that a k-mer can not be it's own reverse complement. So 31 is the largest *odd* k-mer size that can be squeezed into a machine word.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Rob Patro

Try unrolling a thread yourself!

More from @nomad421

Rob Patro

Rob Patro

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Like this author's thread?