⚙️Methods⚙️ paper📜 on RADseq

This one is for you if you think your RADseq data is 🗑️, & you feel 😭/😥 about it!


But before, shoutout to @MaurstadMarius who shares first-authorship w/ me, after teaching himself Unix as an undergraduate🧐- impressive!
This paper begins with this dataset on marine ghost-worms ( peerj.com/articles/10896/ ).

We had >90% missing data across the dataset, results that were impossible to make sense from (PCA/trees), and issues just getting a reasonably high number of high-quality variants...
In a conversation w/ @ncrochette, he suggested I should just run stacks on a population-by-population level and explore what was happening. I noticed:

1. ≠ populations had very ≠ nr. of variants;
2. some individuals at the population level had tons of missing data (bad apples)
Why would specimens have tons of missing data at the population level?

One expects individuals on a population to share the majority of enzyme cut-sites. Maybe something to do with library-preparation/sequencing/DNA quality? Since ...
... when dealing with various species, divergence is expected to correlate with the loss restriction sites. An increase of divergence translates to more "allele dropout" (i.e. missing data); but this is not expected at the population-level.
What we noticed is that by removing these individuals with tons of missing data, and then reassembling the dataset without them ... we obtained a much-improved dataset!

After testing this in 3 other datasets, we generally confirmed these results.
In the paper, we explore the impact of
1. removing random samples;
2. removing "bad apples";
3. removing "bad apples" in different parts of the stacks' pipeline;


4. explore data with PCA;
5. explore "properties" of the SNPs kept and removed (before & after removing samples)
... We conclude that this process will benefit non-model datasets by
1. increasing number of loci in the data; &
2. decreasing missing data.

One particular advantage is that we do not find evidence of removing a particular "class of loci". This is:
A. If you are very strict in your filters, you may keep only very conserved loci;
B. If you are very liberal, you may have may pass down artefacts on your final dataset.

Both A and B have been shown in RADseq papers, which we cite and discuss.
We therefore think we came up with a simple method to "clean-up" datasets that may suffer mostly from library-building, sequencing and DNA-quality issues - which would explain "allele dropout" at the population/species level.
Big shoutout to @ncrochette @jcatchen @arcolon14 @RayamajhiN for comments and their god-tier expertise and very constructive comments on RADseq; @fez_nhm for actually suggesting we should turn this into a paper, and helping analyzing the data.
& @MaurstadMarius for being motivated with bioinformatics and doing tons of stacks' run;

That's all folks :)

• • •

Missing some Tweet in this thread? You can try to force a refresh

Keep Current with José Cerca

José Cerca Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!


Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @IslandGenomics

10 Jul 18
Our paper <Marine connectivity dynamics: clarifying cosmopolitan distributions of marine interstitial invertebrates and the meiofauna paradox> is out in @Mar_Biology 😁

1/n Here's a summary of what we found and discussed (and a personal significance at the end):
* We did a literature survey and analyzed 700+ contributions
* Despite including keywords such as 'molecular OR cryptic OR ... dispersal OR phylo* OR biogeo* OR distribut*' only 7 contributions (>1%) focused on evolutionary biology!
* Ecology and Taxonomy made up nearly 95% (488; 235 respectively) ... with a majority of contributions focusing on nematodes and arthropods (harpacticoid copepods).
* Paleontology, Development and Physiology added up to only 31 contributions!
Read 13 tweets

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!