3: To explain, let me introduce you to 'lineage A' and 'lineage B', aka 'clade II' and 'clade I', respectively, in this paper by Zhang et al. These lineages co-circulated in China during the early days of the pandemic, and they differ at two key sites.
4: Lineage B genomes, such as the reference genome Wuhan/Hu-1, have a 'C' at position 8782 (genomes up to #546 in this alignment). Lineage A genomes have a 'T'.
And at position 28144, Lineage B has 'T' and lineage A has 'C'. At 8782/28144, then, B=C/T and A=T/C.
5: A relatively small number of early genomes, though, have C/C or T/T. They appear to be transitional, because if A evolved from B, or vice versa, you would see a C/C or T/T pattern after one of the two substitutions in that evolutionary journey had occurred.
6: Why do we think they might all be erroneous?
Step back to the fading moments of life-as-you-used-to-know-it, when a man stepped off a plane from China at Sea-Tac airport, on Jan 15, 2020, then became the first in the US to be diagnosed with COVID-19.
7: His #SARSCoV2 genome was rapidly sequenced by CDC and was named 'WA1' - presumably for 'Washington case #1'. Then we waited...and waited...for the other shoe to drop: community spread in the US.
8: It dropped on Feb 29, 2012. In a thread that sent shock-waves through the scientific community, Trevor Bedford at the Hutch reported that a second genome had been recovered from a community-acquired case, and it was eerily close to WA1's.
9: Indeed, it appeared that WA2 had descended directly from WA1. This implied that cryptic community transmission had already been happening for *6 weeks* in the US. Not good! (Though note that Trevor pointed out that the close relationship *could* be a coincidence).
This important work would go on to be published in Science.
11: I was initially deeply convinced by this argument. But the more I got the feel for how this virus evolves, the more I thought it was strange that all the genomes in WA State from Feb and beyond had two key substitutions away from WA1.
12: At positions 17747 and 17848, WA1 was C/A and WA2 and all other 'WA outbreak' genomes in Feb, Mar and beyond were T/G. Why the clean separation? If WA1 really did kick things off, why weren't we seeing genomes identical, or at least closer, to it?
13: I knew what was needed: to re-run the epidemic in WA State over and over again to see if the clean, two-nucleotide difference would be observed if WA1 really *had* started the US outbreak.
14: I started looking through the literature for software that would allow one to simulate the Washington State outbreak under realistic epidemiological parameters, then evolve #SARSCoV2 genomes though the infectees.
Came across a package called called FAVITES.
15: To my pleasant surprise, one of the co-authors was my former PhD student, Joel Wertheim. Teamed up with the brilliants Joel, @pekar, @suchard_group and @LemeyLab and showed that WA1 was deeply unlikely to have started the outbreak.
16: The pattern observed in WA State, which is very similar to the lineage A and lineage B pattern in China in early 2020, was just didn't seem consistent with a 'single introduction' scenario. Instead, similar viruses seem to have jumped in twice.
17: We published these findings in the same issue of Science as Trevor's paper:
18: But there had been one big mystery. Transitional genomes, with only one of the two diagnostic substitutions, *had* been reported from neighbouring British Columbia (BC). These C/G genomes seemed to undermine the 'two introduction' model (one for WA1, one for WA2).
19: These genomes vexed the hell out of me for weeks.
I suspected they might contain an error at one of the two key sites, but wasn't sure. If they *were* real, then maybe WA1 *did* start off off the whole outbreak. Did he travel through Vancouver and infect people there?
20: Finally, it dawned on me that the genomes themselves might give up the secret of whether they were real or just artefacts due to sequencing issues.
Did the transitional C/G genomes share substitutions (other than the two already mentioned) with T/G genomes like WA2?
21: Check out this figure from our paper. See how the BC genome at the very top shares 4 substitutions with the one at the very bottom? If the top, transitional, one were real, that would mean those very same 4 substitutions had happened independently in the bottom genome.
23: That is like two people, each with their own deck of cards, drawing 4 cards each and finding out that they drew exactly the same cards. Even with a deck of 52 cards, that is near impossible.
But the SARS-CoV-2 deck has ~30,000 cards (one for each nucleotide).
24: It is simply not something likely to happen by chance: it's the 'transitional' virus's 'tell' that it's not transitional after all. It just has an error at site 17747 and is two substitutions different from WA1 after all, like the other Feb-Mar genomes.
25: Leaving us with the strong conclusion that WA1 and WA2 had separate introductions into the US (with the WA outbreak introduction, incidentally, happening a bit later, around Feb 1).
26: Which, at long last (sorry!) brings me back to the supposedly transitional genomes between lineage A and lineage B in China. We see a pattern very similar to the BC (and other) likely-artefactual transitional genomes:
They share substitutions with 'pure' A and B genomes.
27: So some of these are virtually *certain* to be erroneous for one reason or another, and we believe it is unlikely that there are *any* real transitional genomes between A and B.
28: Why is this crucial to understanding how the pandemic started?
If there really are just pure A and B viruses from early in China, the WA1 story teaches us that that *might* be because each lineage had a separate origin from an animal to a human.
29: I'll be the first to admit that I thought the idea of multiple introductions was bonkers when I first encountered it. But we now know that animals like civets and raccoon dogs, were present in Wuhan wet markets, with shared supply chains.
30: If a human-transmissible SARS-CoV-2 progenitor was circulating among such animals, the SARS1 story teaches us that it would be likely to jump multiple times into humans.
31: The possible lack of any real A/B-transitional genomes makes me take the two-intro model much, much more seriously. It's by no means settled, but we have developed the technology to test that hypothesis...we'll see.
32: Important: if lineage A and lineage B had separate origins, then the time their common ancestors existed, and when the 'patient zero' of each lineage was infected, might be considerably later than estimates assuming a single jump, like ours:
1: I want to follow up the thread below with some additional clarification of why we hypothesize that there may be no real #SARSCoV2 genomes transitional between lineages A and B.
2: @daoyu15 has written a thread asserting that we "toss any genomes that don't fit your conclusions away". I'm afraid this is incorrect on multiple counts.
3: What we show is that many of the putatively transitional genomes bear obvious evidence of being artefacts - probably due to bioinformatic pipelines, rather than sequencing errors per se. (Issues like calling a site with poor coverage to be the base of a reference genome.)
[Worobey] would like to see the scientific and intelligence communities collaborate on the problem. "I would hope and assume that this 90-day sprint is going to turn into a nice long jog where there could be some back-and-forth."
3/4 Crucial point US IC elements agree on:
"China’s officials did not have foreknowledge of the virus before the initial outbreak of COVID-19 emerged".
So could we *please* collectively move on from claim that WIV database removal in Sept 2019 was part of a cover-up/conspiracy?
The study, led by Dr. Elisabetta Tanzi, also includes heavy-hitters of molecular evolution @sergeilkp and Sudhir Kumar. I greatly admire both but respectfully disagree with their conclusions here and feel it is important to explain why. 2/
Dr. Tanzi led an earlier study claiming to find evidence of SARS-CoV-2 in a boy in Northern Italy who presented with measles symptoms in Nov 2019. 3/
Here I explain why I (continue to) think that a zoonotic origin of SARS-CoV-2 is more likely than a lab leak scenario - even though I signed 'The Science Letter'. 1/