Malfoy Profile picture
5 Nov, 11 tweets, 3 min read
So excited to share our latest preprint!
TLDR:
We present NIQKI, a Mash-like tool able to index large sequence databases.
Its main feature is to be fast!
We were able to index all bacterial genomes of GenBank (>1million) and query all pairwise distances in a couple of days(!)
Behind the scenes, we rely on a novel index structure to query partition-based fingerprints that we call NIQKI (Next Index to Query Kmer Intersection) which use an inverted index for each partition
Concretely, we associate to each <partition/fingerprint_value>pair the list of genomes that got this fingerprint value in this partition.
This means that a query will only generate one costly "Random Access" per partition.
In practice, it means really fast queries, that do not depend that much on the size of the database.
Here Mash is red Dashing green and NIQKI is blue.
But this structure came at a memory cost because we have a memory overhead exponential with the size of the indexed fingerprint #ouch
But since our Index can index any kind of fingerprint we were free to try several approaches
NIQKI Minhash
NIQKI Hyperloglog
NIQKI Hyperminhash and so on...
Doing so we studied how to build the smallest fingerprint with the lowest possible false-positive rate.
To do so we proposed a generalization of Hyperminhash that cover Minhash, Hyperloglog and Hyperminhash cases
In theory, our (h,m)-HMH fingerprint can be tuned to provide the lowest false positive rate if the subsampling factor of your sketch can be roughly estimated (Genome_Size/sketch_size).
The good news is it can be easily estimated and works well in practice!
So our NIQKI index uses such tuned Hyperminhash fingerprint and can deliver precise results with reasonable memory usage.
Here we plot the correlation between our estimation with Dashing using 1million fingerprints on 1000 Refseq genomes
We really believe that such an index will unlock a plethora of existing applications as we can basically compare a genome/MAG/contig to ALL GenBank bacterial genome within a second
We will continue to work on the software as it can still be improved in MANY ways. Comments, suggestions, and feedback on the tool/source/manuscript are welcomed!
This @biorxiv_bioinfo is the manuscript submitted to @recomb2022 #fingercrossed

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Malfoy

Malfoy Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(