We've once again updated our paper benchmarking long-read assemblers for bacterial genomes! Take a look at the fresh results here: f1000research.com/articles/8-2138
Updates since the last version include...
(1/9)
New versions of some assemblers: Canu v2.0, Flye v2.8, Raven v1.1.10 and Shasta v0.5.1. My favourite change here is that Flye no longer requires a genome size parameter.
(2/9)
I've also added a new assembler to the comparison: NextPolish/NextDenovo. It performed well on chromosomes but not on plasmids, and it was more cumbersome to run than the other tools.
(3/9)
I've moved some supp figures into the main text. Most interesting to me is panel E which shows the maximum indel error size in assemblies.
(4/9)
This shows that all assemblers can sometimes make very large errors: hundreds or even thousands of bases in size! Flye performed best in this regard, often keeping its errors under 10 bp, but it wasn't totally immune to the problem.
(5/9)
This issue (big indel errors in assemblies) was one of the main reasons I created Trycycler. It makes a consensus assembly from multiple input assemblies and can therefore avoid large-scale errors such as these. github.com/rrwick/Trycycl…
(6/9)
My main recommendations have not really changed. Favourites are still Flye (for overall quality), Miniasm/Minipolish (for clean circularisation) and Raven (for speed and reliability).
(7/9)
If you twisted my arm for a single recommendation, I'd have to pick Flye. It does well in most metrics and I really like that it makes fewer big indel errors. Nice work, @fenderglass!
(8/9)
That's all for now! Thanks again to @F1000Research for facilitating these updates so the benchmark can stay up-to-date. Canu v2.1 and NECAT v20200803 are out, so I'll get started on the next version 😀
(9/9)
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Peer review brought quite a few improvements, so many thanks to the reviewers! My favourite addition is this new supp figure.
(2/6)
It shows that Polypolish was the tool least likely to introduce errors during polishing. It only did so at one place in 100 genomes (panel D) where it changed a 3-bp deletion to a 5-bp deletion in a tandem repeat.
(3/6)
I just released a new version of Unicycler (v0.5.0) which fixes SPAdes compatibility, drops some extraneous bits and patches a few bugs. github.com/rrwick/Unicycl…
Unicycler is now nearly 6 years old, so here's a thread with my thoughts on its place in the world in 2022.
(1/8)
Unicycler is a hybrid (short+long) bacterial genome assembly pipeline that takes a short-read-first approach. I.e. it first makes a short-read assembly graph, then uses the long reads to scaffold the graph to completion.
(2/8)
Short-read-first assembly made a lot of sense when Unicycler was first built in 2016. Back then, Nanopore reads were often shallow and low-quality, so the short-read graph made a good a starting point for assembly.
(3/8)
Our preprint describing Polypolish is now up: biorxiv.org/content/10.110…
Polypolish is a short-read polisher for long-read bacterial genome assemblies. Some highlights from the paper follow in this thread...
(1/12)
There are already quite a few short-read polishers out there: HyPo, NextPolish, ntEdit, Pilon, POLCA, Racon and wtpoa. So why did we add to this collection? It's because they nearly all suffer from the same problem with errors in repeats.
(2/12)
When you align short reads to a long-read genome assembly in the 'normal' one-alignment-per-read manner, you often get no coverage over errors in repeats. This is because reads will preferentially align to other error-free instances of the repeat.
(3/12)
I've just released (during #MicroSeq2021) a new short-read polishing tool for fixing errors in long-read bacterial genome assemblies: Polypolish! github.com/rrwick/Polypol…
(1/8)
There are many other short-read polishing tools, including HyPo, NextPolish, ntEdit, Pilon and POLCA. So what does Polypolish do differently to warrant another?
(2/8)
Most other polishers use 'normal' short-read alignments, where each read is aligned to one best location (randomly chosen in a tie). This works fine in non-repeat sequences, but errors in repeats often lead to a lack of alignments and therefore can't be fixed.
(3/8)
Excited to announce a new preprint! We did a study comparing two different @nanopore library prep approaches (ligation and rapid) for bacterial genomes with small plasmids: biorxiv.org/content/10.110…
(1/11)
I really like this paper because it has a clear conclusion simple enough to fit in a tweet: rapid preps are better than ligation preps at recovering small plasmids.
(2/11)
Figure 1 gives a simplified illustration of why we think this is the case: due to their size, small circular plasmids can avoid fragmentation during DNA extraction, leaving no ends for adapter ligation. Rapid preps, in contrast, don't depend on DNA ends.
(3/11)
It is for generating a consensus long-read assembly of a bacterial genome.
(1/9)
I.e. you give Trycycler multiple different long-read assemblies of the same genome, and it produces a single consensus assembly that is better than any of the inputs.
(2/9)
In doing so, Trycycler can repair most of the problems that hide in long-read assemblies. These include: 1) missing/spurious contigs 2) bad circularisation 3) glitchy sequence regions