Read on Twitter

12,399 views

Pat Schloss

@PatSchloss

, 33 tweets, 6 min read Read on Twitter

Please stop using the open (and closed) reference clustering methods based on USEARCH/VSEARCH. Instead, use de novo clustering algorithms to form OTUs 1/

Open/closed reference methods were developed because people have the type of crappy data that you get when reads don't fully overlap. High error rates increase the number of uniques making downstream processing more difficult 2/ blog.mothur.org/2014/09/11/Why…

So why are open (and closed) reference clustering bad? Let's look at closed reference first. I may self-plagiarize in some of the following tweets 3/ peerj.com/articles/1487/

Obviously, you're limited to seeing what is in the database. Considering we're sequencing deeper than any of the studies that generated the reference sequences, we're going to fail to classify many sequences 4/

Using a mouse dataset we found 32,106 unique sequences. Using a 3% threshold we were able to map 27,876 w/ VSEARCH and 27,737 w/ USEARCH - both methods are heuristics, which help to speed things up, but cut corners 5/

When we calculated the actual distances between the mouse sequences and the database sequences, we were able to map 28,238 sequences to a reference 6/

We then randomized the order of the database. We got different numbers of sequences to map to a reference with different randomizations. The default order is probably about as bad as you could do 7/

The XSEARCH heuristics cut corners and don't have perfect sensitivity. But they also result in a loss of specificity - USEARCH was 0.73 and VSEARCH was 0.60 8/

But that's just the start. The reference that many use is based on the greengenes database (fwiw - greengenes is defunct and the database has not been updated) 9/

The reference represents sequences that are not more than 97% similar to each other... over the full length of the gene. We popped out the V4 region of those sequences 10/

Among these V4 fragments, we found that 3,132 ref sequences had one duplicate, of those, 443 had discordant taxonomies. Among the 1,699 V4 reference sequences with two or more duplicates, 698 had discordant taxonomies 11/

We found reference sequences that had 10, 30 and even 131 reference sequences that contained 7, 7, and 5 different taxonomies 12/

We calculated the true distance between the mice & ref seqs: 47% of the mouse seqs mapped to refs that were identical over the V4 region, 17% mapped equally well to 2+ refs that were not identical over the V4 region 13/

13% had a conflicted taxonomy *gulp* 14/

Yes closed-reference OTUs are "stable", but only if you look at it from the samples' perspective. They aren't from the references' perspective. This has consequences for people using Fast UniFrac, PICRUSt, etc 15/

Some of these problems could be solved w/ a ref where the V4 (or whatever region) sequences are not more than 97% similar to each other, but you'd also have to deal with conflicting taxonomies and the poor specificity of XSEARCH 16/

Let's discuss one more thing before moving on to open reference clustering. With closed reference, if a seq is more than 97% similar (or whatever level) to a reference, the seq is assigned to that ref OTU 17/

We've already discussed several problems with this. Another, is that two seqs might only be 94% similar to each other, but be in the same OTU - this isn't what we typically think is going on when we talk about 97% OTUs. Right? 18/

Now open reference clustering... In this approach, sequences are first subjected to closed ref clustering and those that don't map to a reference are run through a de novo clustering approach 19/

In de novo clustering sequences that are more than 97% similar to each other are clustered. There are many ways to do this and I certainly have data/papers to support one over the field of methods 20/

When one uses open ref clustering, you get all the problems of closed reference clustering, but you are also mixing and matching OTU definitions 21/

You get the possibility of sequences that are only 94% similar to each other in the same OTU (closed ref) and those that are at least 97% similar to each other (de novo) 22/

Depending on what type of samples you're sequencing you'll get different ratios of OTUs generated by closed ref and de novo clustering. Human/mouse are mostly closed, soil probably has more de novo 23/

We developed using an objective, distance-based metric for assessing OTU clustering quality. It asks are the correct sequences included in the OTU and are there any sequences that should have been included in the OTU? It's called MCC 24/

Here's what we saw for a few datasets. This shows that the closed/open reference algorithms don't measure up to de novo msystems.asm.org/content/1/2/e0… 25/

Later we developed our own de novo heuristic that optimizes MCC as it forms the OTUs - OptiClust. This forms the best OTUs of any clustering method we've found ncbi.nlm.nih.gov/pubmed/28289728 26/

One nock against de novo methods is that you get different results for each iteration of the method. Yup. What we've found is that you get equally good clusterings - there is no one optimal method 27/

There's also the nock that we shouldn't be using OTUs, but that's a topic for another tweet storm 28/

Appendix: Here's actual data showing what happens when reads don't overlap. As shown in the above blog post, we (and many others) have tried extending the contig lengths with the 2x300 V3 chemistry and it just sucks 29/ aem.asm.org/lookup/doi/10.…

Appendix: Regardless of that plea and a lot of experience, people have pushed using reads that don't fully overlap (or that don't overlap at all). Examples include EMP and AGP data where single reads or 2x150 nt reads were used to sequence the 250 nt V4 region 30/

Appendix: Other examples include people getting greedy and sequencing V3-V4 with 2x250 nt reads or 2x300 nt reads 31/

Footnote: I've been saying the same thing for years now. My critique of open/closed ref clustering was published 4+ years ago and I'm not sure anyone gives a rip. Why not? What am I missing? That we had to write this ncbi.nlm.nih.gov/pubmed/27832214 underscores my point 32/

Fin: Get good data and use the best methods 33/

Like this thread? Get email updates or save it to PDF!

Subscribe to Pat Schloss

This content may be removed anytime!

Try unrolling a thread yourself!

Trending hashtags

Like this thread? Get email updates or save it to PDF!

Subscribe to Pat Schloss

This content may be removed anytime!

Try unrolling a thread yourself!

Related threads

Trending hashtags

Did Thread Reader help you today?