In new study, I have analyzed correlation between SARS-CoV-2 & animal genetic material in full set of environmental samples from Huanan Seafood Market.
Analysis clarifies what sequencing these samples can & cannot tell us about early outbreak at market.biorxiv.org/content/10.110…
Background:
China first reported coronavirus cases associated w market & no human transmission. But then we learned was human transmission & some early cases not from market.
In 2022, Chinese CDC released pre-print describing sampling market beginning Jan-1-2020:
They collected 457 animal & 923 environmental samples. They stated all animal samples tested negative, but 73 environmental samples positive.researchsquare.com/article/rs-137…
In their 2022 pre-print, Chinese CDC described deep sequencing >150 environmental samples.
They included plot (below) that showed SARS-CoV-2 content of samples was correlated w human genetic material. From this, they concluded humans were source of virus in samples at market.
But Chinese CDC did not label other points on plot, so it wasn’t clear what other species had genetic material that correlated with SARS-CoV-2 abundance. They also didn’t provide raw sequencing data to enable other scientists to do this analysis.
This omission was widely noted in multiple news articles where scientists requested access to raw data to analyze which species correlated w SARS-CoV-2 ( & ).science.org/content/articl…
Eventually, Chinese CDC uploaded some of the raw sequencing files to GISAID, where they were downloaded by another group of scientists who started analyzing data.
Before any written analysis posted, media started reporting that data suggested raccoon dogs may have been infected at market because their genetic material was co-mingled w SARS-CoV-2 in environmental samples ( & )nytimes.com/2023/03/16/sci… theatlantic.com/science/archiv…
The next week, the scientists reported their initial analysis, @critschristoph et al, of environmental samples:
Crits-Christoph et al reported some samples contained genetic material from raccoon dogs & other susceptible animal species (like bamboo rats).zenodo.org/record/7754299…
Analysis by Crits-Christoph et al therefore genetically confirmed prior reports that species like raccoon dogs & bamboo rats were present at market.
The genetic details could inform tracing supply of these animals, which is important to investigate.
However, Crits-Christoph et al did not report analysis of SARS-CoV-2 content of samples: they just used Chinese CDC classification of whether samples were “positive”.
So their analysis did not identify which species have genetic material correlated w viral material.
Next week the Chinese CDC uploaded revised version of their pre-print, which shortly thereafter was published in Nature (). They also made all raw sequencing data available on public databases like SRA and NGDC.nature.com/articles/s4158…
New Chinese CDC paper agreed some samples had material from raccoon dogs & other species, although they did metagenomics differently than Crits-Christoph et al (probably not as well).
But they also did not analyze correlation of SARS-CoV-2 & animal genetic material in samples.
In fact, Chinese CDC even removed their earlier incompletely labeled SARS-CoV-2 vs species correlation plot that started all the questions.
So there still hasn’t been any analysis of how SARS-CoV-2 genetic material correlates w that of other animals!
My new analysis addresses what animal genetic material correlates w SARS-CoV-2.
To do this, I wrote fully reproducible computational pipeline that downloads all raw sequencing data (which exceeds 3 terabytes) from NGDC database: github.com/jbloom/Huanan_…
First, I confirmed data Chinese CDC uploaded to NGDC is superset of data analyzed by Crits-Christoph.
See SHA-512 file hashes:
So regardless of earlier controversy about data access, unmodified versions of all files now publicly available.github.com/jbloom/Huanan_…
I then aligned sequencing reads to concatenated reference of SARS-CoV-2 & chordate mitochondrial genomes.
This enabled me to quantify both how much SARS-CoV-2 & mitochondrial genetic material from each species is in each sample.
I correlated species compositions from my analysis with compositions reported by Crits-Christoph et al (which are only for mammals).
Results highly correlated (see figure below). This is good: two independent analyses get similar results for species compositions of samples.
But note composition depends on reference set. I use chordates; Crits-Christoph et al report compositions for mammals.
Below are mitochondrial compositions for sample Q61: raccoon dog most abundant mammal, duck most abundant chordate.
Different if you align contigs to full genomes: then more raccoon dog than duck (Fig 1B of my pre-print)
There isn’t one correct way. Here I use mitochondrial composition to be consistent w Crits-Christoph et al report & because some species don’t have full genomes available.
Now for new part: what is SARS-CoV-2 content of samples?
Below is plot of percent of reads mapping to SARS-CoV-2 for each sample.
Most samples have little or no SARS-CoV-2. Samples with most SARS-CoV-2 have mitochondrial material mostly from fish.
Many ways to partition samples: date collected, etc. Interactive plot at allows you to do that.
Eg, if we only look at later sampling dates, sample w most SARS-CoV-2 dominated by rat snake, dove, & human mitochondrial material.jbloom.github.io/Huanan_market_…
What about samples w lots of mitochondrial material from susceptible non-human animals like raccoon dogs?
Below is table of SARS-CoV-2 content of all samples with >20% of their chordate mitochondrial material from a susceptible non-human species.
There are 14 samples w >20% chordate mitochondrial material from raccoon dog: 13 have no SARS2 reads, other has 1 in ~200,000,000 reads mapping to SARS2
0 of 6 samples w >20% bamboo rat material have SARS2 reads
1 sample each w Malayan porcupine & Amur hedgehog have SARS2 reads
We can correlate number of SARS-CoV-2 reads w mitochondrial reads for each species across all samples (below).
Highest correlation for largemouth bass, catfish, cow, carp, snakehead fish
Humans modestly correlated w SARS2 reads
Raccoon dogs negatively correlated w SARS2 reads
There are many ways to subset data on sampling dates, calculate correlation, etc.
Finally, we can circle back to question scientists asked when Chinese CDC first posted pre-print in 2022: what if you label other species in correlation plot?
Below is that plot, shown only for samples w at least one SARS2 read for consistency w original Chinese CDC figure.
Species most correlated w SARS2 are fish & livestock, followed by humans. Raccoon dogs & bamboo rats negatively correlated w SARS2.
Similar if we only look at samples collected on Jan-12-2020, which was date of most intense wildlife stall sampling.
Again, lots of ways to subset samples & calculate correlations, & you can explore them using the interactive plots at jbloom.github.io/Huanan_market_…
So how did we end up w media articles about raccoon dog material co-mingled w SARS2?
Raccoon dogs are one of species least co-mingled w SARS2, and Q61 raccoon dog sample only has 1 of 200,000,000 reads mapping to SARS2.
May have to do w how Chinese CDC called sample positivity
In their pre-print/paper, Chinese CDC called “positive” any sample that either tested positive by RT-qPCR or had >0 sequencing reads mapping to SARS-CoV-2.
But these environmental samples: they mix various animal and/or viral sequences (plus probably index hopping in sequencing)
Criteria Chinese CDC used to call positivity aren’t consistent.
Eg, Q61 was negative by RT-qPCR, but has 1 of 200,000,000 reads mapping to SARS2. Not consistent to call Q61 positive but call negative other samples that also tested negative by RT-qPCR & were never sequenced.
Therefore, I suggest future work should stop using “positive” / “negative” classification of Chinese CDC table, & instead analyze quantitative SARS2 content only across samples subjected to same consistent set of assays (eg, Ct values or SARS2 read content).
More broadly, what can we conclude about COVID-19 origins from all this?
Probably not much.
@DrTedros of @WHO had correct interpretation: we should analyze everything, but these data don’t tell us how pandemic began
However, these tables get so big they are difficult to look at.
That is point of interactive scatter plots here:
Choose any species, sampling date, etc & then see SARS2 vs mitochondrial content in one small plot & mouse over points for details.jbloom.github.io/Huanan_market_…
I have posted updated version of preprint on bioRxiv:
This update includes addition of tables mentioned in three Tweets above, plus some revisions and responses to thoughtful comments by @flodebarre @acritschristoph detailed here: biorxiv.org/content/10.110…
The final peer-reviewed version of my analysis of the environmental samples at the Huanan market is now published in Virus Evolution: academic.oup.com/ve/article/9/2…
I have performed additional new analyses comparing SARSCoV2 to other animal CoVs in environmental samples from Huanan Market
Paper w these new analyses is & full computational pipeline at
I first calculated total reads mapping to SARSCoV2 & other animal CoVs across all samples, & just samples collected on date of wildlife stall sampling (Jan-12-2020)
Six CoVs have >500 reads; of these 4 have many reads from Jan-12-2020 samples, but 2 have few reads from that date
Specifically, bamboo rat CoV, two canine CoVs, & rabbit CoV have substantial reads in samples from Jan-12-2020 wildlife-stall sampling
I next analyzed reads on per-sample basis. Below I just show results for Jan-12-2020 as that is when most samples w material from potentially susceptible animals (eg, raccoon dogs, bamboo rats) collected.
For bamboo rat, canine & rabbit CoVs there were samples w 100s viral reads. Samples w most viral reads had largest frac animal material from known hosts
But all Jan-12-2020 samples had few SARSCoV2 or rat CoV reads, & samples w most reads had little material from plausible hosts
To better explore the data in above plot, see for an interactive plot that enables you to mouseover points for details on samples, select individual CoVs to display, and show additional dates.jbloom.github.io/Huanan_market_…
I also plotted viral vs animal genetic content for Jan-12-2020 samples.
For 4 most abundant animal CoVs in these samples there is association of viral & host animal content, but not for much less abundant SARSCoV2 and rat CoV
Overall, these results show that genetic material from some animal CoVs is fairly abundant in samples collected during the wildlife-stall sampling of the Huanan Market on Jan-12-2020. However, SARSCoV2 is not one of these CoVs.
For the animal CoVs with high abundance on Jan-12-2020, there are meaningful associations between the content of viral and animal genetic material. But there are not such associations for SARSCoV2 and less abundant viruses like the rat CoV Lucheng-19.
There remain significant caveats related to the underlying available data, as discussed in limitations section at end of my initial paper on this topic () academic.oup.com/ve/article/9/2…
@Nucleocapsoid @HNimanFC @mrmickme2 @0bFuSc8 @PeacockFlu @CVRHutchinson @SCOTTeHENSLEY To add to thread linked above, human British Columbia H5 case has a HA sequence (GISAID EPI_ISL_19548836) that is ambiguous at *both* site Q226 and site E190 (H3 numbering)
Both these sites play an important role in sialic acid binding specificity
@Nucleocapsoid @HNimanFC @mrmickme2 @0bFuSc8 @PeacockFlu @CVRHutchinson @SCOTTeHENSLEY If you are searching literature, these sites are E190 and Q226 in H3 numbering, E186 and Q222 in mature H5 numbering, and E202 and Q238 in sequential H5 numbering (see: )dms-vep.org/Flu_H5_America…
Here is analysis of HA mutations in H5 influenza case in Missouri resident without known contact w animals or raw milk.
TLDR: there is one HA mutation that strongly affects antigenicity, and another that merits some further study.
As background, CDC recently released partial sequence of A/Missouri/121/2024, which is virus from person in Missouri who was infected with H5 influenza.
Here I am analyzing HA protein from this release, GISAID accession EPI_ISL_19413343cdc.gov/bird-flu/spotl…
Sequence covers all of HA except signal peptide, and residues 325-351 (sequential numbering) / 312-335 (H3 numbering). The missing residues encompass HA1-HA2 boundary, and any missed mutations there unlikely to affect antigenicity or receptor binding, but could affect stability.
In new study led by @bblarsen1 in collab w @veeslerlab @VUMC_Vaccines we map functional & antigenic landscape of Nipah virus receptor binding protein (RBP)
Results elucidate constraints on RBP function & provide insight re protein’s evolutionary potentialbiorxiv.org/content/10.110…
Nipah is bat virus that sporadically infects humans w high (~70%) fatality rate. Has been limited human transmission
Like other paramyxoviruses, Nipah uses two proteins to enter cells: RBP binds receptor & then triggers fusion (F) protein by process that is not fully understood
RBP forms tetramer in which 4 constituent monomers (which are all identical in sequence) adopt 3 distinct conformations
RBP binds to two receptors, EFNB2 & EFNB3
RBP’s affinity for EFNB2 is very high (~0.1 nM, over an order of magnitude higher than SARSCoV2’s affinity for ACE2)
Over 4 yrs after being first to publicly release SARS-CoV-2 genome, Yong-Zhen Zhang just published large set of viral seqs from first stage of COVID-19 outbreak in China
Zhang recruited nearly all COVID-19 patients hospitalized at Shanghai Public Health Center in first 2/3 (Jan-Sep) of 2020.
The largest source of Shanghai patients in Jan/Feb 2020 was imported cases from Wuhan or elsewhere in Hubei, thereby providing window into Wuhan outbreak.
Overall, Zhang obtained 343 near-full-length SARS-CoV-2 sequences from 226 distinct patients, including 133 sequences from samples collected no later than Feb-15-2020.
A phylogenetic tree showing these sequences is below.