So excited to share our latest preprint!
TLDR:
We present NIQKI, a Mash-like tool able to index large sequence databases.
Its main feature is to be fast!
We were able to index all bacterial genomes of GenBank (>1million) and query all pairwise distances in a couple of days(!)
Behind the scenes, we rely on a novel index structure to query partition-based fingerprints that we call NIQKI (Next Index to Query Kmer Intersection) which use an inverted index for each partition
Concretely, we associate to each <partition/fingerprint_value>pair the list of genomes that got this fingerprint value in this partition.
This means that a query will only generate one costly "Random Access" per partition.
In practice, it means really fast queries, that do not depend that much on the size of the database.
Here Mash is red Dashing green and NIQKI is blue.
But this structure came at a memory cost because we have a memory overhead exponential with the size of the indexed fingerprint #ouch
But since our Index can index any kind of fingerprint we were free to try several approaches
NIQKI Minhash
NIQKI Hyperloglog
NIQKI Hyperminhash and so on...
Doing so we studied how to build the smallest fingerprint with the lowest possible false-positive rate.
To do so we proposed a generalization of Hyperminhash that cover Minhash, Hyperloglog and Hyperminhash cases
In theory, our (h,m)-HMH fingerprint can be tuned to provide the lowest false positive rate if the subsampling factor of your sketch can be roughly estimated (Genome_Size/sketch_size).
The good news is it can be easily estimated and works well in practice!
So our NIQKI index uses such tuned Hyperminhash fingerprint and can deliver precise results with reasonable memory usage.
Here we plot the correlation between our estimation with Dashing using 1million fingerprints on 1000 Refseq genomes
We really believe that such an index will unlock a plethora of existing applications as we can basically compare a genome/MAG/contig to ALL GenBank bacterial genome within a second
We will continue to work on the software as it can still be improved in MANY ways. Comments, suggestions, and feedback on the tool/source/manuscript are welcomed!
This @biorxiv_bioinfo is the manuscript submitted to @recomb2022#fingercrossed
• • •
Missing some Tweet in this thread? You can try to
force a refresh