1/ A recent preprint (papers.ssrn.com/sol3/papers.cf…) reporting detection of sequence and antibody evidence for SARS-CoV-2 in Italy in the fall of 2019 presents results that are at odds with the current early SARS-CoV-2 timeline.
2/ It may be tempting to dismiss these results as false positives or some other data artifact (e.g.
), but should it be done for these “inconvenient" data?
3/ Or rather, should we think carefully how to examine the “early European spread” hypothesis by seeking early data more systematically (as the preprint calls for) and considering which alternative models might fit the totality of available early data?
4/ While I usually agree with @MichaelWorobey and greatly respect his accomplishments in the “viral origins” domain, I do not agree with him in this case. I do think it’s absolutely essential to have an informed debate on these important issues.
5/ I am not a big fan of long Twitter threads (seems anti-pattern to what the platform was designed for). I will keep this as brief as I can.
6/ First (and as also pointed out by @BallouxFrancois), there are now multiple independent reports on the presence of SARS-CoV-2 in Europe in 2019 and in the US earlier than the original estimates. These are based on many non-overlapping sources and done by different labs.
7/ Some of these studies have been questioned as possible false positives, but is it reasonable to assume that they ALL are false positives, especially when many studies took assiduous care to control false discovery?
8/ This places the preprint in the context of existing results which cumulatively provide stronger evidence than any of the individual studies.
9/ Any extent of SARS-CoV-2 community spread in Europe before December 2019 (regardless of what clade it was) is enough to force a major reassessment of the current early viral timeline.
10/ Second, the pattern of positives in the preprint is simply not what you would expect with a positive control PCR contamination or a false positive result from the (highly sensitive) nested PCR. If all the positives are simply amplifying a positive control contaminant, then:
11/ (i) why are there zero positives in the >90 “negative control” samples, subjected to the same exact protocol? (ii) why are there many different amplicon sequences retrieved (instead of identical copies of the contaminant)
12/ (iii) why is there independent (protein-level) evidence of antibodies? The large number of negative samples, diversity in amplicon sequences, and detection of antibodies (different workstream, orthogonal technology) do not comport with the PCR contamination scenario
13/ Third, it is true that the detection of S614/RdRp323 variants in 2019 Italian samples is incompatible with a single ‘clade A’ source in late October/early November 2019. However, it also does not imply that the virus arose in Italy (or Europe) as
14/ the most obvious question then would be – where is the plausible zoonotic source for the introduction it Italy (I don't see one)? Due to the nature of the samples and the protocols, these are very short sequences and are not well-suited to a complete phylogenetic analysis
15/ (which is why all the preprint does is place them in the context of the global mutation order sequence). I do not think that these data can be directly used to resolve the contradiction, and more sequences (especially from archival samples in China) from 2019 are needed.
16/ Finally, I do not subscribe to the consequantialist school of thought when it comes to what should inform scientific inquiry.
17/ I do not think at all that the Italian (and other European) data suggest that the source is outside China (but rather that it was EARLIER in China), but even if they did, this is not reason to withhold reporting a comprehensively analyzed and unique dataset.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
The analysis of recovered sequences does not fundamentally change our current understanding of early SARS-CoV-2 evolution, but it does make the hypothesis of a single-source wet market outbreak implausible.
The rooting of the tree (i.e. what the progenitor sequence is) is also more likely in clade A, i.e. the Wu-1 genome is not the ancestral genome; simlilar to what we find in academic.oup.com/mbe/advance-ar…, and
An update on #SARSCoV2 selection analysis using @GISAID data (observablehq.com/@spond/natural…). I added a simple 5-category classification for each potential interesting site. One category = one point. The more points, the more interesting a site is.
Category 1. Is the site under selection using statistical comparative methods?
Category 2. Is there a large (>20%, which is incidentally what you can detect with mixed bases) fraction of minority alleles (synonymous or non-synonymous) among viral haplotypes at the site.
Category 3. Is there an upward trend over time in how many sequences carry a variant, i.e. do we see that variant frequency is increasing over time?
Category 4. Do we see multiple evolutionary events on the tree, i.e. more than one internal branch with selection?