We searched 5.7M seq libraries (10.2 petabases) for all 15,000 known RNA viruses. In 11 days, we uncovered 130,000+ new RNA viruses (incl 9 new CoV, with a twist). That’s near an order of magnitude bump.
[1/N] 🧵👇
[2] For the Scientific Conclusions, @rayanchikhi has a great thread from the preprint:
👉
[3] As the pandemic hit, like many scientists we wanted to help. The idea was simple: analyze all public sequencing data to ensure every possible Coronavirus sequence ever sampled is identified and freely available. And do it fast.
(aka Eye of SRAn)
[4] By luck, @NIHDataScience STRIDES had just finished mirroring the massive Sequence Read Archive (SRA) to cloud platforms. An opportunity!
[5] The world’s DNA/RNA sequencing was at our fingertips as an Open Dataset on @awscloud. Accessing 20 million gigabytes of sequencing data was no longer a bottleneck, we eventually did this in under 11 days.
[7] The coolest part of open-source projects is teaming up with awesome devs who improve their tools too; We got a tailored v. of SPAdes: coronaSPAdes (protip you can use it for any RNA virus); and a sig. boost in small-query alignment for DIAMOND v2. Stay tuned for MUSCLE v5!
[9] Serratus is a volunteer project. We started out at the #hacksqRNA hackathon (ty: @RNASociety’ / @UBC MedGen) and continue to have an open-door collaboration policy (cough*you should join*)
[10] We took part in COVID19 #bioHackathon, @EUvsVirus, @hackzurich, @redhat Team19, sent out tweets, emailed bioinformaticians and virologists. Eventually we got an amazing and passionate crew together. <3 <3
[11] Huge thanks to the long list of people who took the time to discuss, share insights or just popped in for a few days to help. And to the team at @UBC#CIC and AWS who helped make this possible.
And of course, what matters most is the friends we made along the way…
[13] All Serratus data is free and public (cc0) immediately. Our goal is to catalyze research into Earth’s virome as intuitively as possible. Reach out if any help is needed :)