Thread by @jbloom_lab on Thread Reader App

In a new study, I identify and recover a deleted set of #SARSCoV2 sequences that provide additional information about viruses from the early Wuhan outbreak: biorxiv.org/content/10.110… (1/n)

Specifically, NIH maintains the Sequence Read Archive, where scientists around world deposit deep sequencing data for others to analyze. I noted peerj.com/articles/9255 lists all #SARSCoV2 data in archive as of March-31-2020. Most from a project by Wuhan University. (2/n)

But when I went to Sequence Read Archive, I found entire project was gone! (Note that as detailed below, this does *not* imply malfeasance by NIH. Sequence Read Archive policy allows submitters to delete by e-mail request.) (3/n)

I was able to determine deleted data corresponded to a study that partially sequenced “45 nasopharyngeal samples from [Wuhan] outpatients with suspected COVID-19 early in the epidemic“ medrxiv.org/content/10.110… (4/n)

I discovered that even though the files were deleted from archive itself, they could be recovered from the Google Cloud at links like storage.googleapis.com/nih-sequence-r… (5/n)

Using this approach, I recovered files for the 34 early samples that were virus positive. I was able to use the data in the files to reconstruct partial viral sequences (from start of spike to end of ORF10) for 13 of these samples. (6/n)

Now I need to give background to explain a confusing scientific mystery about other early #SARSCoV2 sequences. Although events that led to emergence of #SARSCoV2 in Wuhan are unclear (zoonosis vs lab accident), everyone agrees deep ancestors are coronaviruses from bats. (7/n)

Therefore, we’d expect the first #SARSCoV2 sequences would be more similar to bat coronaviruses, and as #SARSCoV2 continued to evolve it would become more divergent from these ancestors. But that is *not* the case! (8/n)

Instead, early Huanan Seafood Market #SARSCoV2 viruses are more different from bat coronaviruses than #SARSCoV2 viruses collected later in China and even other countries. @lpipes @ras_nielsen give nice technical analysis at academic.oup.com/mbe/article/38… (9/n)

The conundrum is easily seen by plotting the relative differences from the bat coronavirus RaTG13 outgroup versus collection date for early #SARSCoV2. See how the first reported viruses from Wuhan (leftmost blue points) aren’t the closest to RaTG13. (10/n)

Same result if we use other bat coronaviruses like RpYN06 or RmYN02. To see this, go to jbloom.github.io/SARS-CoV-2_PRJ… for an interactive plot that allows you to select the bat coronavirus outgroup and mouse over points for strain details. (11/n)

How do deleted sequences I recovered relate to this conundrum? If we include those sequences, and note 4 sequences from Guangdong are from two groups of people infected in Wuhan in late Dec / early Jan, we get plausible scenarios that resolve above problems. (12/n)

These two scenarios are plotted below. Each has a different “progenitor”, which is the sequence that gave rise to all *currently* known #SARSCoV2 sequences (still may not be virus that infected patient zero if other early sequences remain unknown). (13/n)

Both putative progenitors have 3 mutations relative to Seafood Market viruses that make them more similar to bat coronavirus. One is progenitor inferred by @kumar_lab @sergeilkp et al (academic.oup.com/mbe/advance-ar…), other has C8782T, T28144C, and C29095T relative to Wuhan-Hu-1. (14/n)

Both progenitors suggest #SARSCoV2 was circulating in Wuhan before December outbreak at Huanan Seafood Market, which is corroborated by lots of other evidence, including news articles from China in early 2020 (see intro to my paper linked in first Tweet in this thread). (15/n)

There are also broader implications. First, fact this dataset was deleted should make us skeptical that all other relevant early Wuhan sequences have been shared. We already know many labs in China ordered to destroy early samples: scmp.com/news/china/soc… (16/n)

Sequence sharing could be further limited by fact that scientists in China are under an order from the State Council requiring central approval of all publications: apnews.com/article/united… (17/n)

Second major implication is that it may be possible to obtain additional information about early spread of #SARSCoV2 in Wuhan even if efforts for more on-the-ground investigations are stymied. (18/n)

Scientific communication and data sharing typically rely on trust. The NIH Sequence Read Archive has >13,000,000 runs, so they have to trust authors when they request deletions as not feasible to validate reasons for all requests, some of which are legitimate. (19/n)

In case of data set I describe above, it seems possible that trust that the NIH Sequence Read Archive grants to scientific authors to delete data may have been used to obscure sequences informative for understanding early #SARSCoV2. (20/n)

Fortunately, Sequence Read Archive has rigorous data tracking enabling them to determine when data deleted & stated justification by authors. In fact, @NIHDirector @NCBI have already determined this & generously shared info w me, but will let them share more widely. (21/n)

It is important to examine if other trust-based systems in science conceivably may have also been used to hide data relevant to origins / early spread of #SARSCoV2. This includes not only looking more at sequence databases, but also paper reviews, grant reporting, etc. (22/n)

Third major implication is that scientists need to stay focused on data-driven study of #SARSCoV2 origins / early spread. After spending the last 4 months studying this closely, I am cautiously optimistic that additional relevant data are still likely to come to light. (23/n)

We should therefore avoid dogmatic arguments about #SARSCoV2 origins / early spread, and instead focus on following two questions: (1) How can we get more data? (2) How can we better analyze the data we have? (24/n)

Finally, my analysis is on GitHub at github.com/jbloom/SARS-Co… where you can access all code, data, & paper drafts. All updates are via time-stamped commits. This ensures transparency/reproducibility of this study are not in doubt, regardless of your views on interpretation. (25/n)

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll