🦠🧬🖥️ Excited to share release v1.0 of Bakta: a new tool for rapid & standardized annotations of bacterial genomes & #plasmids github.com/oschwengers/ba… more details 👇 1/n
Bakta annotates bacterial genomes & plasmids in minutes comparable to Prokka (kudos @torstenseemann) adding distinct features: annotation of small proteins, alignment-free protein identification, a taxon-independent DB, ample Dbxrefs, JSON output, utilizing complete replicons 2/n
In addition to standard feature types, Bakta also detects and annotates known small proteins (#sORF) (<30 aa) which are often overlooked b/c they’re not called by gene prediction tools, e.g. Prodigal. They play vital roles in virulence, stress response, gene regulation, etc. 3/n
Bakta identifies known protein seqs via hash digests and attaches public stable identifiers from #RefSeq & @UniProt (WP_*, UPI*, UniRef100_*) fostering @FAIRsharing_org principles, enabling surveillance of gene alleles, streamlining comparative analysis and post. annotations. 4/n
Currently, Bakta identifies ~185 mio distinct UniRef100 protein sequences. Hence, for certain genomes, up to 99% of all CDS can be identified w/o alignments thus sparing computationally-expensive sequence alignments. 5/n
Utilized @UniProt UniRef100 / UniRef90 protein clusters are further enriched with Dbxrefs (EC, GO, COG, RefSeq, UniParc) and pre-annotated via specialized DBs (AMR, IS). 6/n
The complete set of annotated feature types comprises tRNA, tmRNA, rRNA, ncRNA genes, ncRNA cis-regulatory regions, CRISPR arrays, CDS (including sORF), oriC/oriV, oriT and assembly gaps. 7/n
Bakta exports all information as machine-readable JSON, human-readable TSV and standard bioinformatics file formats: GFF3, GenBank, EMBL. The latter INSDC-compliant & validated by ENA Webin-CLI for genome submissions 8/n
Bakta can handle complete replicons within partial draft assemblies to annotate CDS spanning sequence edges improving annotations of partially complete genomes 9/n
Bakta is fast! It annotates a bacterial genome in 10 ±5 min on a laptop, plasmids in a couple of seconds/minutes requiring only 4 Gb of memory. 10/n
Bakta utilizes a taxon-independent SemVer versioned DB comprising: AA & DNA sequences, HMM & covariance models and a compact read-only SQLite db storing protein sequence digests, lengths, pre-assigned annotations & dbxrefs hosted at @ZENODO_ORG doi.org/10.5281/zenodo… 11/n
With this feature set and runtime characteristics Bakta aims at a well-balanced tradeoff between fully featured but computationally-demanding pipelines (PGAP) and rapid highly customizable pipelines (Prokka). 12/n
Bakta is new! Hence, bug reports and feedback of any kind are very much appreciated and feature requests are highly welcome! 13/n

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Oliver Schwengers

Oliver Schwengers Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(