🦠🧬🖥️ Excited to share release v1.0 of Bakta: a new tool for rapid & standardized annotations of bacterial genomes & #plasmidsgithub.com/oschwengers/ba… more details 👇 1/n
Bakta annotates bacterial genomes & plasmids in minutes comparable to Prokka (kudos @torstenseemann) adding distinct features: annotation of small proteins, alignment-free protein identification, a taxon-independent DB, ample Dbxrefs, JSON output, utilizing complete replicons 2/n
In addition to standard feature types, Bakta also detects and annotates known small proteins (#sORF) (<30 aa) which are often overlooked b/c they’re not called by gene prediction tools, e.g. Prodigal. They play vital roles in virulence, stress response, gene regulation, etc. 3/n
Bakta identifies known protein seqs via hash digests and attaches public stable identifiers from #RefSeq & @UniProt (WP_*, UPI*, UniRef100_*) fostering @FAIRsharing_org principles, enabling surveillance of gene alleles, streamlining comparative analysis and post. annotations. 4/n
Currently, Bakta identifies ~185 mio distinct UniRef100 protein sequences. Hence, for certain genomes, up to 99% of all CDS can be identified w/o alignments thus sparing computationally-expensive sequence alignments. 5/n
Utilized @UniProt UniRef100 / UniRef90 protein clusters are further enriched with Dbxrefs (EC, GO, COG, RefSeq, UniParc) and pre-annotated via specialized DBs (AMR, IS). 6/n
The complete set of annotated feature types comprises tRNA, tmRNA, rRNA, ncRNA genes, ncRNA cis-regulatory regions, CRISPR arrays, CDS (including sORF), oriC/oriV, oriT and assembly gaps. 7/n
Bakta exports all information as machine-readable JSON, human-readable TSV and standard bioinformatics file formats: GFF3, GenBank, EMBL. The latter INSDC-compliant & validated by ENA Webin-CLI for genome submissions 8/n
Bakta can handle complete replicons within partial draft assemblies to annotate CDS spanning sequence edges improving annotations of partially complete genomes 9/n
Bakta is fast! It annotates a bacterial genome in 10 ±5 min on a laptop, plasmids in a couple of seconds/minutes requiring only 4 Gb of memory. 10/n
Bakta utilizes a taxon-independent SemVer versioned DB comprising: AA & DNA sequences, HMM & covariance models and a compact read-only SQLite db storing protein sequence digests, lengths, pre-assigned annotations & dbxrefs hosted at @ZENODO_ORGdoi.org/10.5281/zenodo… 11/n
With this feature set and runtime characteristics Bakta aims at a well-balanced tradeoff between fully featured but computationally-demanding pipelines (PGAP) and rapid highly customizable pipelines (Prokka). 12/n
Bakta is new! Hence, bug reports and feedback of any kind are very much appreciated and feature requests are highly welcome! 13/n
• • •
Missing some Tweet in this thread? You can try to
force a refresh