Bloom Lab Profile picture
Apr 27, 2023 57 tweets 19 min read Read on X
In new study, I have analyzed correlation between SARS-CoV-2 & animal genetic material in full set of environmental samples from Huanan Seafood Market.


Analysis clarifies what sequencing these samples can & cannot tell us about early outbreak at market.biorxiv.org/content/10.110…
Background:

China first reported coronavirus cases associated w market & no human transmission. But then we learned was human transmission & some early cases not from market.

Thus began still unanswered questions of role of market, nicely summarized here science.org/content/articl…
In 2022, Chinese CDC released pre-print describing sampling market beginning Jan-1-2020:

They collected 457 animal & 923 environmental samples. They stated all animal samples tested negative, but 73 environmental samples positive.researchsquare.com/article/rs-137…
In their 2022 pre-print, Chinese CDC described deep sequencing >150 environmental samples.

They included plot (below) that showed SARS-CoV-2 content of samples was correlated w human genetic material. From this, they concluded humans were source of virus in samples at market. Image
But Chinese CDC did not label other points on plot, so it wasn’t clear what other species had genetic material that correlated with SARS-CoV-2 abundance. They also didn’t provide raw sequencing data to enable other scientists to do this analysis.
This omission was widely noted in multiple news articles where scientists requested access to raw data to analyze which species correlated w SARS-CoV-2 ( & ).science.org/content/articl…
Eventually, Chinese CDC uploaded some of the raw sequencing files to GISAID, where they were downloaded by another group of scientists who started analyzing data.
Before any written analysis posted, media started reporting that data suggested raccoon dogs may have been infected at market because their genetic material was co-mingled w SARS-CoV-2 in environmental samples ( & )nytimes.com/2023/03/16/sci…
theatlantic.com/science/archiv…
The next week, the scientists reported their initial analysis, @critschristoph et al, of environmental samples:

Crits-Christoph et al reported some samples contained genetic material from raccoon dogs & other susceptible animal species (like bamboo rats).zenodo.org/record/7754299…
Analysis by Crits-Christoph et al therefore genetically confirmed prior reports that species like raccoon dogs & bamboo rats were present at market.

The genetic details could inform tracing supply of these animals, which is important to investigate.
However, Crits-Christoph et al did not report analysis of SARS-CoV-2 content of samples: they just used Chinese CDC classification of whether samples were “positive”.

So their analysis did not identify which species have genetic material correlated w viral material.
Next week the Chinese CDC uploaded revised version of their pre-print, which shortly thereafter was published in Nature (). They also made all raw sequencing data available on public databases like SRA and NGDC.nature.com/articles/s4158…
New Chinese CDC paper agreed some samples had material from raccoon dogs & other species, although they did metagenomics differently than Crits-Christoph et al (probably not as well).

But they also did not analyze correlation of SARS-CoV-2 & animal genetic material in samples.
In fact, Chinese CDC even removed their earlier incompletely labeled SARS-CoV-2 vs species correlation plot that started all the questions.

So there still hasn’t been any analysis of how SARS-CoV-2 genetic material correlates w that of other animals!
My new analysis addresses what animal genetic material correlates w SARS-CoV-2.

To do this, I wrote fully reproducible computational pipeline that downloads all raw sequencing data (which exceeds 3 terabytes) from NGDC database: github.com/jbloom/Huanan_…
First, I confirmed data Chinese CDC uploaded to NGDC is superset of data analyzed by Crits-Christoph.

See SHA-512 file hashes:

So regardless of earlier controversy about data access, unmodified versions of all files now publicly available.github.com/jbloom/Huanan_…
I then aligned sequencing reads to concatenated reference of SARS-CoV-2 & chordate mitochondrial genomes.

This enabled me to quantify both how much SARS-CoV-2 & mitochondrial genetic material from each species is in each sample.
I correlated species compositions from my analysis with compositions reported by Crits-Christoph et al (which are only for mammals).

Results highly correlated (see figure below). This is good: two independent analyses get similar results for species compositions of samples. Image
But note composition depends on reference set. I use chordates; Crits-Christoph et al report compositions for mammals.

Below are mitochondrial compositions for sample Q61: raccoon dog most abundant mammal, duck most abundant chordate. Image
Different if you align contigs to full genomes: then more raccoon dog than duck (Fig 1B of my pre-print)

There isn’t one correct way. Here I use mitochondrial composition to be consistent w Crits-Christoph et al report & because some species don’t have full genomes available.
You can go to to interactively look up mitochondrial species composition for any sample.jbloom.github.io/Huanan_market_…
Now for new part: what is SARS-CoV-2 content of samples?

Below is plot of percent of reads mapping to SARS-CoV-2 for each sample.

Most samples have little or no SARS-CoV-2. Samples with most SARS-CoV-2 have mitochondrial material mostly from fish. Image
Many ways to partition samples: date collected, etc. Interactive plot at allows you to do that.

Eg, if we only look at later sampling dates, sample w most SARS-CoV-2 dominated by rat snake, dove, & human mitochondrial material.jbloom.github.io/Huanan_market_…
What about samples w lots of mitochondrial material from susceptible non-human animals like raccoon dogs?

Below is table of SARS-CoV-2 content of all samples with >20% of their chordate mitochondrial material from a susceptible non-human species. Image
There are 14 samples w >20% chordate mitochondrial material from raccoon dog: 13 have no SARS2 reads, other has 1 in ~200,000,000 reads mapping to SARS2

0 of 6 samples w >20% bamboo rat material have SARS2 reads

1 sample each w Malayan porcupine & Amur hedgehog have SARS2 reads
We can correlate number of SARS-CoV-2 reads w mitochondrial reads for each species across all samples (below).

Highest correlation for largemouth bass, catfish, cow, carp, snakehead fish

Humans modestly correlated w SARS2 reads

Raccoon dogs negatively correlated w SARS2 reads Image
There are many ways to subset data on sampling dates, calculate correlation, etc.

Interactive plots at & let you see how correlations change when you do that.jbloom.github.io/Huanan_market_…
jbloom.github.io/Huanan_market_…
Finally, we can circle back to question scientists asked when Chinese CDC first posted pre-print in 2022: what if you label other species in correlation plot?

Below is that plot, shown only for samples w at least one SARS2 read for consistency w original Chinese CDC figure. Image
Species most correlated w SARS2 are fish & livestock, followed by humans. Raccoon dogs & bamboo rats negatively correlated w SARS2.

Similar if we only look at samples collected on Jan-12-2020, which was date of most intense wildlife stall sampling.
Again, lots of ways to subset samples & calculate correlations, & you can explore them using the interactive plots at jbloom.github.io/Huanan_market_…
So how did we end up w media articles about raccoon dog material co-mingled w SARS2?

Raccoon dogs are one of species least co-mingled w SARS2, and Q61 raccoon dog sample only has 1 of 200,000,000 reads mapping to SARS2.

May have to do w how Chinese CDC called sample positivity Image
In their pre-print/paper, Chinese CDC called “positive” any sample that either tested positive by RT-qPCR or had >0 sequencing reads mapping to SARS-CoV-2.

But these environmental samples: they mix various animal and/or viral sequences (plus probably index hopping in sequencing)
Criteria Chinese CDC used to call positivity aren’t consistent.

Eg, Q61 was negative by RT-qPCR, but has 1 of 200,000,000 reads mapping to SARS2. Not consistent to call Q61 positive but call negative other samples that also tested negative by RT-qPCR & were never sequenced. Image
Therefore, I suggest future work should stop using “positive” / “negative” classification of Chinese CDC table, & instead analyze quantitative SARS2 content only across samples subjected to same consistent set of assays (eg, Ct values or SARS2 read content).
More broadly, what can we conclude about COVID-19 origins from all this?

Probably not much.

@DrTedros of @WHO had correct interpretation: we should analyze everything, but these data don’t tell us how pandemic began
Recall market samples were collected on Jan-1-2020 or later.

First human SARS2 infections in Wuhan occurred no later than Nov 2019.

By Jan 2022, SARS2 had been spread widely across market by humans, regardless of how it originated.
Viral material is most co-mingled w material from fish & livestock products, but virus clearly did NOT originate w those species & products.

It’s simply that environmental samples taken over month after humans started spreading virus do not reliably indicate outbreak origin.
If we ever learn origin of SARS2, I suspect it will come from information on events that occurred in Nov 2019 (or earlier):

Until then, we should analyze all available data--but be circumspect & cognizant of limits of these data & our knowledge.
Finally, interactive versions of all plots from my analysis are at

Computer code is at

Pre-print is at

I hope others explore & build on this computer code w further analyses.jbloom.github.io/Huanan_market_…
github.com/jbloom/Huanan_…
biorxiv.org/content/10.110…
In response to discussion (), I don't think analysis disproves or proves source

Just emphasizes these samples collected too late to reveal origin

We need info on earlier events

Unless we get that, we need to acknowledge don't know exactly what happened
I was getting questions re 20% cutoff used to decide which samples to show in Table 1 of pre-print.

Analysis is of all samples, 20% cutoff is just to make Table 1 manageable in size.

For any sample w >1% chordate mitochondrial material from raccoon dogs, see bigger table below Image
If you want SARS2 content of all samples ordered by raccoon dog mitochondrial %, see this bigger table:

If you want comparable data for all samples *and* all species, see this even bigger table: github.com/jbloom/Huanan_…
github.com/jbloom/Huanan_…
However, these tables get so big they are difficult to look at.

That is point of interactive scatter plots here:

Choose any species, sampling date, etc & then see SARS2 vs mitochondrial content in one small plot & mouse over points for details.jbloom.github.io/Huanan_market_…
I have posted updated version of preprint on bioRxiv:

This update includes addition of tables mentioned in three Tweets above, plus some revisions and responses to thoughtful comments by @flodebarre @acritschristoph detailed here:
biorxiv.org/content/10.110…
The final peer-reviewed version of my analysis of the environmental samples at the Huanan market is now published in Virus Evolution: academic.oup.com/ve/article/9/2…
I have performed additional new analyses comparing SARSCoV2 to other animal CoVs in environmental samples from Huanan Market

Paper w these new analyses is & full computational pipeline at

The new results are summarized below.doi.org/10.1093/ve/vea…
github.com/jbloom/Huanan_…
I first calculated total reads mapping to SARSCoV2 & other animal CoVs across all samples, & just samples collected on date of wildlife stall sampling (Jan-12-2020)

Six CoVs have >500 reads; of these 4 have many reads from Jan-12-2020 samples, but 2 have few reads from that date Image
Specifically, bamboo rat CoV, two canine CoVs, & rabbit CoV have substantial reads in samples from Jan-12-2020 wildlife-stall sampling

SARSCoV2 & rat CoV have few reads from that date

See for interactive plot w more options jbloom.github.io/Huanan_market_…
Image
I next analyzed reads on per-sample basis. Below I just show results for Jan-12-2020 as that is when most samples w material from potentially susceptible animals (eg, raccoon dogs, bamboo rats) collected. Image
For bamboo rat, canine & rabbit CoVs there were samples w 100s viral reads. Samples w most viral reads had largest frac animal material from known hosts

But all Jan-12-2020 samples had few SARSCoV2 or rat CoV reads, & samples w most reads had little material from plausible hosts Image
To better explore the data in above plot, see for an interactive plot that enables you to mouseover points for details on samples, select individual CoVs to display, and show additional dates.jbloom.github.io/Huanan_market_…
I also plotted viral vs animal genetic content for Jan-12-2020 samples.

For 4 most abundant animal CoVs in these samples there is association of viral & host animal content, but not for much less abundant SARSCoV2 and rat CoV

(Also see interactive plot ) jbloom.github.io/Huanan_market_…
Image
Overall, these results show that genetic material from some animal CoVs is fairly abundant in samples collected during the wildlife-stall sampling of the Huanan Market on Jan-12-2020. However, SARSCoV2 is not one of these CoVs.
For the animal CoVs with high abundance on Jan-12-2020, there are meaningful associations between the content of viral and animal genetic material. But there are not such associations for SARSCoV2 and less abundant viruses like the rat CoV Lucheng-19.
There remain significant caveats related to the underlying available data, as discussed in limitations section at end of my initial paper on this topic () academic.oup.com/ve/article/9/2…
Image
Please see my full new paper at for more details, and interactive plots that allow you to explore the data in additional ways.doi.org/10.1093/ve/vea…
jbloom.github.io/Huanan_market_…
Corrected link to interactive plots: jbloom.github.io/Huanan_market_…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Bloom Lab

Bloom Lab Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @jbloom_lab

Nov 16
@Nucleocapsoid @HNimanFC @mrmickme2 @0bFuSc8 @PeacockFlu @CVRHutchinson Good observations. See also this thread posted by @SCOTTeHENSLEY:

I have added a few notes to the bottom of that thread.

To recap here:bsky.app/profile/scotte…
@Nucleocapsoid @HNimanFC @mrmickme2 @0bFuSc8 @PeacockFlu @CVRHutchinson @SCOTTeHENSLEY To add to thread linked above, human British Columbia H5 case has a HA sequence (GISAID EPI_ISL_19548836) that is ambiguous at *both* site Q226 and site E190 (H3 numbering)

Both these sites play an important role in sialic acid binding specificity
@Nucleocapsoid @HNimanFC @mrmickme2 @0bFuSc8 @PeacockFlu @CVRHutchinson @SCOTTeHENSLEY If you are searching literature, these sites are E190 and Q226 in H3 numbering, E186 and Q222 in mature H5 numbering, and E202 and Q238 in sequential H5 numbering (see: )dms-vep.org/Flu_H5_America…
Read 6 tweets
Oct 8
Below is brief analysis of HA mutations in two recent cases of H5N1 influenza in humans w contact w dairy cattle in California.

Summary is that while virus continues to evolve, nothing about HA mutations in these human cases is obviously alarming. Image
As background, CDC reported several recent cases of H5 influenza in California.

CDC and California DOH recently shared sequences of two of these cases via GISAID.
cdc.gov/media/releases…
California human cases share two HA mutations relative to "consensus" dairy cattle virus HA:

D95G & S336N in H3 numbering (D88G & S320N in H5 numbering; D014G & S336N in sequential numbering).

Both these mutations also in some dairy cattle HAs, so not unique to human cases. Image
Read 10 tweets
Sep 15
Here is analysis of HA mutations in H5 influenza case in Missouri resident without known contact w animals or raw milk.

TLDR: there is one HA mutation that strongly affects antigenicity, and another that merits some further study.
As background, CDC recently released partial sequence of A/Missouri/121/2024, which is virus from person in Missouri who was infected with H5 influenza.


Here I am analyzing HA protein from this release, GISAID accession EPI_ISL_19413343cdc.gov/bird-flu/spotl…
Sequence covers all of HA except signal peptide, and residues 325-351 (sequential numbering) / 312-335 (H3 numbering). The missing residues encompass HA1-HA2 boundary, and any missed mutations there unlikely to affect antigenicity or receptor binding, but could affect stability.
Read 16 tweets
May 25
In new study led by @bdadonaite, we measure how all mutations to H5 influenza HA affect four molecular phenotypes relevant to pandemic risk:


Results can inform surveillance of ongoing evolution of H5N1. biorxiv.org/content/10.110…
Image
To measure how all HA mutations affect those phenotypes, we created pseudovirus libraries of HA from WHO clade 2.3.4.4b vaccine strain.

Pseudoviruses encode no genes other than HA, so can only do a single cycle of infection making them safe for biosafety-level-2. Image
First, we measured how all mutations affected HA-mediated cell entry, which is essential for viral fitness

See heatmap below, which is easily visualized interactively at

Some sites constrained (orange); others w many well tolerated mutations (white/blue) dms-vep.org/Flu_H5_America…
Image
Read 15 tweets
Apr 20
In new study led by @bblarsen1 in collab w @veeslerlab @VUMC_Vaccines we map functional & antigenic landscape of Nipah virus receptor binding protein (RBP)


Results elucidate constraints on RBP function & provide insight re protein’s evolutionary potentialbiorxiv.org/content/10.110…
Nipah is bat virus that sporadically infects humans w high (~70%) fatality rate. Has been limited human transmission

Like other paramyxoviruses, Nipah uses two proteins to enter cells: RBP binds receptor & then triggers fusion (F) protein by process that is not fully understood
RBP forms tetramer in which 4 constituent monomers (which are all identical in sequence) adopt 3 distinct conformations

RBP binds to two receptors, EFNB2 & EFNB3

RBP’s affinity for EFNB2 is very high (~0.1 nM, over an order of magnitude higher than SARSCoV2’s affinity for ACE2) Image
Read 12 tweets
Mar 5
Over 4 yrs after being first to publicly release SARS-CoV-2 genome, Yong-Zhen Zhang just published large set of viral seqs from first stage of COVID-19 outbreak in China


He uses data to suggest scenarios re early outbreak & root of viral phylogenetic tree academic.oup.com/ve/advance-art…
Image
Zhang recruited nearly all COVID-19 patients hospitalized at Shanghai Public Health Center in first 2/3 (Jan-Sep) of 2020.

The largest source of Shanghai patients in Jan/Feb 2020 was imported cases from Wuhan or elsewhere in Hubei, thereby providing window into Wuhan outbreak. Image
Overall, Zhang obtained 343 near-full-length SARS-CoV-2 sequences from 226 distinct patients, including 133 sequences from samples collected no later than Feb-15-2020.

A phylogenetic tree showing these sequences is below. Image
Read 12 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(