In a new study, I identify and recover a deleted set of #SARSCoV2 sequences that provide additional information about viruses from the early Wuhan outbreak: biorxiv.org/content/10.110… (1/n)
Specifically, NIH maintains the Sequence Read Archive, where scientists around world deposit deep sequencing data for others to analyze. I noted peerj.com/articles/9255 lists all #SARSCoV2 data in archive as of March-31-2020. Most from a project by Wuhan University. (2/n)
But when I went to Sequence Read Archive, I found entire project was gone! (Note that as detailed below, this does *not* imply malfeasance by NIH. Sequence Read Archive policy allows submitters to delete by e-mail request.) (3/n)
I was able to determine deleted data corresponded to a study that partially sequenced “45 nasopharyngeal samples from [Wuhan] outpatients with suspected COVID-19 early in the epidemic“ medrxiv.org/content/10.110… (4/n)
I discovered that even though the files were deleted from archive itself, they could be recovered from the Google Cloud at links like storage.googleapis.com/nih-sequence-r… (5/n)
Using this approach, I recovered files for the 34 early samples that were virus positive. I was able to use the data in the files to reconstruct partial viral sequences (from start of spike to end of ORF10) for 13 of these samples. (6/n)
Now I need to give background to explain a confusing scientific mystery about other early #SARSCoV2 sequences. Although events that led to emergence of #SARSCoV2 in Wuhan are unclear (zoonosis vs lab accident), everyone agrees deep ancestors are coronaviruses from bats. (7/n)
Therefore, we’d expect the first #SARSCoV2 sequences would be more similar to bat coronaviruses, and as #SARSCoV2 continued to evolve it would become more divergent from these ancestors. But that is *not* the case! (8/n)
The conundrum is easily seen by plotting the relative differences from the bat coronavirus RaTG13 outgroup versus collection date for early #SARSCoV2. See how the first reported viruses from Wuhan (leftmost blue points) aren’t the closest to RaTG13. (10/n)
Same result if we use other bat coronaviruses like RpYN06 or RmYN02. To see this, go to jbloom.github.io/SARS-CoV-2_PRJ… for an interactive plot that allows you to select the bat coronavirus outgroup and mouse over points for strain details. (11/n)
How do deleted sequences I recovered relate to this conundrum? If we include those sequences, and note 4 sequences from Guangdong are from two groups of people infected in Wuhan in late Dec / early Jan, we get plausible scenarios that resolve above problems. (12/n)
These two scenarios are plotted below. Each has a different “progenitor”, which is the sequence that gave rise to all *currently* known #SARSCoV2 sequences (still may not be virus that infected patient zero if other early sequences remain unknown). (13/n)
Both putative progenitors have 3 mutations relative to Seafood Market viruses that make them more similar to bat coronavirus. One is progenitor inferred by @kumar_lab@sergeilkp et al (academic.oup.com/mbe/advance-ar…), other has C8782T, T28144C, and C29095T relative to Wuhan-Hu-1. (14/n)
Both progenitors suggest #SARSCoV2 was circulating in Wuhan before December outbreak at Huanan Seafood Market, which is corroborated by lots of other evidence, including news articles from China in early 2020 (see intro to my paper linked in first Tweet in this thread). (15/n)
There are also broader implications. First, fact this dataset was deleted should make us skeptical that all other relevant early Wuhan sequences have been shared. We already know many labs in China ordered to destroy early samples: scmp.com/news/china/soc… (16/n)
Sequence sharing could be further limited by fact that scientists in China are under an order from the State Council requiring central approval of all publications: apnews.com/article/united… (17/n)
Second major implication is that it may be possible to obtain additional information about early spread of #SARSCoV2 in Wuhan even if efforts for more on-the-ground investigations are stymied. (18/n)
Scientific communication and data sharing typically rely on trust. The NIH Sequence Read Archive has >13,000,000 runs, so they have to trust authors when they request deletions as not feasible to validate reasons for all requests, some of which are legitimate. (19/n)
In case of data set I describe above, it seems possible that trust that the NIH Sequence Read Archive grants to scientific authors to delete data may have been used to obscure sequences informative for understanding early #SARSCoV2. (20/n)
Fortunately, Sequence Read Archive has rigorous data tracking enabling them to determine when data deleted & stated justification by authors. In fact, @NIHDirector@NCBI have already determined this & generously shared info w me, but will let them share more widely. (21/n)
It is important to examine if other trust-based systems in science conceivably may have also been used to hide data relevant to origins / early spread of #SARSCoV2. This includes not only looking more at sequence databases, but also paper reviews, grant reporting, etc. (22/n)
Third major implication is that scientists need to stay focused on data-driven study of #SARSCoV2 origins / early spread. After spending the last 4 months studying this closely, I am cautiously optimistic that additional relevant data are still likely to come to light. (23/n)
We should therefore avoid dogmatic arguments about #SARSCoV2 origins / early spread, and instead focus on following two questions: (1) How can we get more data? (2) How can we better analyze the data we have? (24/n)
Finally, my analysis is on GitHub at github.com/jbloom/SARS-Co… where you can access all code, data, & paper drafts. All updates are via time-stamped commits. This ensures transparency/reproducibility of this study are not in doubt, regardless of your views on interpretation. (25/n)
I’ve updated SARSCoV2 antibody-escape calculator w new deep mutational scanning data of @yunlong_cao @jianfcpku
My interpretation: antigenic evolution currently constrained by pleiotropic effects of mutations on RBD-ACE2 affinity, RBD up-down position & antibody neutralization
@Nucleocapsoid @HNimanFC @mrmickme2 @0bFuSc8 @PeacockFlu @CVRHutchinson @SCOTTeHENSLEY To add to thread linked above, human British Columbia H5 case has a HA sequence (GISAID EPI_ISL_19548836) that is ambiguous at *both* site Q226 and site E190 (H3 numbering)
Both these sites play an important role in sialic acid binding specificity
@Nucleocapsoid @HNimanFC @mrmickme2 @0bFuSc8 @PeacockFlu @CVRHutchinson @SCOTTeHENSLEY If you are searching literature, these sites are E190 and Q226 in H3 numbering, E186 and Q222 in mature H5 numbering, and E202 and Q238 in sequential H5 numbering (see: )dms-vep.org/Flu_H5_America…
Here is analysis of HA mutations in H5 influenza case in Missouri resident without known contact w animals or raw milk.
TLDR: there is one HA mutation that strongly affects antigenicity, and another that merits some further study.
As background, CDC recently released partial sequence of A/Missouri/121/2024, which is virus from person in Missouri who was infected with H5 influenza.
Here I am analyzing HA protein from this release, GISAID accession EPI_ISL_19413343cdc.gov/bird-flu/spotl…
Sequence covers all of HA except signal peptide, and residues 325-351 (sequential numbering) / 312-335 (H3 numbering). The missing residues encompass HA1-HA2 boundary, and any missed mutations there unlikely to affect antigenicity or receptor binding, but could affect stability.
In new study led by @bblarsen1 in collab w @veeslerlab @VUMC_Vaccines we map functional & antigenic landscape of Nipah virus receptor binding protein (RBP)
Results elucidate constraints on RBP function & provide insight re protein’s evolutionary potentialbiorxiv.org/content/10.110…
Nipah is bat virus that sporadically infects humans w high (~70%) fatality rate. Has been limited human transmission
Like other paramyxoviruses, Nipah uses two proteins to enter cells: RBP binds receptor & then triggers fusion (F) protein by process that is not fully understood
RBP forms tetramer in which 4 constituent monomers (which are all identical in sequence) adopt 3 distinct conformations
RBP binds to two receptors, EFNB2 & EFNB3
RBP’s affinity for EFNB2 is very high (~0.1 nM, over an order of magnitude higher than SARSCoV2’s affinity for ACE2)