Can't wait to share the KMCP preprint with you, hope it's not too late. Thanks for reading and any feedback is welcome and appreciated. biorxiv.org/content/10.110…
We present a novel metagenomic profiling tool, KMCP, which not only allows for accurate taxonomic profiling of archaea, bacteria, and viral populations from metagenomic shotgun sequence data, but also provides confident pathogen detection for clinical samples of low depth.
We reimplement a modified COBS data structure to index reference genomes. The index size of KMCP is smaller than COBS, and the batch searching speed is increased by near 10 times. And the database construction is fast, taking only 25 min to index 47,894 genomes from GTDB.
To reduce the high false positives in traditional k-mer based methods, we introduce genomic positions to k-mers by splitting the reference genomes into chunks. We track both the origin and the approximate location in the reference for each query, denoted as pseudo-mapping.
KMCP utilizes genome coverage (chunks faction) to filter out spurious references, i.e., the reference genome chunks need to be uniformly covered by reads. Meanwhile, we require a matched genome having some high-confidence uniquely matched reads with high k-mer coverages.
Benchmarks on CAMI2 mouse gut data and Sun's simulated datasets showed KMCP has high accuracies close to MetaPhlAn3 and mOTUs3. Though mOTUs3 has good accuracy on prokaryotic metagenome datasets, it lacks the ability of viral detection.
Benchmarks on a mock virome dataset showed that KMCP outperformed Centrifuge, Bracken, and MetaPhlAn3, with high accuracies of both taxon identification and abundance estimation.
In pathogen detection, KMCP had an accuracy close to Kraken2, while generating shorter prediction lists and generating pathogens in priority positions by considering both sequence similarity and the genome chunks fraction, which help researchers rapidly interpret the reports.
We provide 3 pre-built databases for metagenomic profiling, including the prokaryotic database created with 47,894 archaea and bacteria from GTDB, the viral database with 27,936 virus genomes from Genbank, and the fungal database with 403 fungi genomes from RefSeq.
With the same reference genomes, KMCP generates much smaller databases (66 GB) than Bracken (304 GB) and Centrifuge (97 GB), and requires less (<60 GB) memory for searching than Kraken2/Bracken (250 GB) and Centrifuge (97 GB).
The mergeability of search results makes it flexible to build databases for different reference datasets and choose various databases to search. To update databases, users can either re-built the database with newly added genomes or only build an additional DB for the new genomes
KMCP searching speed is much slower than Bracken, Centrifuge, MetaPhlAn3, and mOTUs3. Fortunately, KMCP could utilize computer clusters to linearly accelerate the reads searching, with each node searching against a small database built with a partition of the reference genomes.
KMCP is implemented in #golang, with a single executable binary file available, and the source code is freely available under the MIT License. The latest version of KMCP can be obtained from the Bioconda channel of Conda package manager and github.com/shenwei356/kmcp
• • •
Missing some Tweet in this thread? You can try to
force a refresh
🎉KMCP is out in Bioinformatics. In the paper we present a novel k-mer-based metagenomic profiling tool that combines k-mer similarity and genome coverage information to increase the profiling accruacy. academic.oup.com/bioinformatics…
KMCP splits the ref genomes into chunks and stores k-mers in a optimized COBS index for fast alignment-free sequence searching. Like quasi-mapping in RapMap, we tracks the target and position for each query. However, the read position is approximate and in a predefined resolution
Unlike LCA-based methods, which retrieve the TaxId of each k-mer in the read and assign the LCA of the resulting TaxIds to the read, KMCP's searching step does not assign any taxonomic label to the queries; therefore, search results from multiple databases can be merged.