This thread addresses the claim in Worobey et al that KDE analysis shows centering of Dec 2019 COVID case-residences on the Huanan Market.
science.org/doi/10.1126/sc…
This apparent centering is an artifact due to use of an overly-large bandwidth in the KDE calculation. Image
Adjusting the bandwidth parameter to more realistic values shifts the center of the KDE pattern away from the Huanan market, to an neighbourhood north of the market where there truly is a significant cluster of case-residences. Image
Worobey et al base their interpretation on simplified KDE maps in which the influence of each data point is smeared over a large area (oversmoothing).
The pattern is centered on the Huanan Market, but this is simply an artifact of oversmoothing. Image
Worobey et al place significant interpretational weight on the oversmoothed KDE maps, and particularly the apparent centering on the Huanan Market.
The oversmoothed KDE and its 1% probability contour are emphasized in Fig 1b and c, and in associated text, with the additional claim that a similar centering is shown by the KDE for subset of 120 cases which were epidemiologically unlinked to the Huanan Market. Image
The oversmoothed KDE and its 1% contour are further emphasized in the various tweets by the authors…
KDE (Kernel Density Estimation) is a non-parametric method to estimate the distribution of a population of univariate or multivariate data. It is used in GIS to generate heat maps.
en.wikipedia.org/wiki/Kernel_de…
It works by placing a kernel function over each data point, and summing those functions to get the KDE.
Easiest to visualize in 1D.
Data: black bars.
Kernel function: red dashed line.
KDE: blue line.
Credit: en.wikipedia.org/wiki/User:Drle… Image
The KDE is:
▪️not sensitive to the *shape* of the kernel function
▪️very sensitive to the *area of influence* of the kernel function (the bandwidth parameter).
Over-large bandwidths result in oversmoothing.
en.wikipedia.org/wiki/Kernel_de…
en.wikipedia.org/wiki/User:Drle… ImageImage
Here’s an animation to illustrate the influence of bandwidth on KDE in 1D
kdepy.readthedocs.io/en/latest/band…
2D mapping example of undersmoothed and oversmoothed KDEs for a study of wildfire ignition points.
These maps illustrate a key aspect of generating KDE maps: the need to select a just-right value for the bandwidth.
aloki.hu/pdf/1604_47014… Image
Worobey et al (2022) used the ‘kde’ function in the ‘ks’ package in R to generate their KDEs.
cran.r-project.org/web/packages/k…
They used the default bandwidth option in ‘kde’, yielding oversmoothed KDEs which ignore local clusters and encompass large area without data points. Image
What would the pattern look like if not oversmoothed?
Here are the KDE contours on the full dataset of 155 case-residences (linked and unlinked to the Huanan market), calculated with ‘kde’, using a reduced bandwidth (obtained by dividing the H bandwidth-matrix by 20). Image
Side-by-side comparison of oversmoothed and reduced-bandwidth versions.
In the reduced-bandwidth version, local patterns are preserved, and the contours do not enclose large areas without data points. ImageImage
Zooming in, and comparing to the default oversmoothed 1% curve (dashed line), one can see how the oversmoothed 1% curve is an artifact of the high bandwidth, a spatial averaging of the large cluster to the north of the market with a smaller cluster to the south of the market. Image
We see similar results if we consider only the cases which were epidemiologically unlinked to the Huanan Market.
The oversmoothed version from Worobey et al… Image
…and reduced-bandwidth version (H matrix divided by 20)… Image
Side-by-side comparison of oversmoothed and reduced-bandwidth versions.
Again, in the reduced-bandwidth version, local patterns are preserved, and the contours do not enclose large areas without data points. ImageImage
Zooming in, and comparing to the oversmoothed 1% curve, one sees the same pattern as with linked+unlinked cases: the oversmoothed 1% curve is an artifact, a spatial averaging of the large cluster to the north of the market with a smaller cluster to the south of the market. Image
Conclusion:
Worobey et al rely on oversmoothed KDEs which provide a misleading representation of the spatial distribution of Dec 2019 case-residences.
This is not a valid spatial analysis, and does not support the contention that the Huanan Market is the origin point of COVID.
Addendum: it can be useful to look at just the dots, without the distraction of contours Image
Image
Image
Caveat: case-residence locations shown in this thread are as extracted by Worobey et al from low-res maps in who.int/publications/i…
and include locational error:
🌐error in the original data and plotting of the low-res maps
🌐error in coordinate extraction by Worobey et al ImageImage
Addendum: Bandwidth and Area of Influence
In a KDE, the area of influence around each point is controlled by the kernel function and the bandwidth.
R package ‘ks’ uses the Gaussian kernel function, with a bandwidth of 1 standard deviation (1 sigma).
Image: ncbi.nlm.nih.gov/pmc/articles/P… Image
For the Gaussian kernel, the area of influence extends out to at least 2 sigma.
For a univariate Gaussian, you get ~68% and ~95% under 1 and 2 sigma, respectively.
But for a bivariate Gaussian, you get ~39% and ~86%.
(Table 1, ncbi.nlm.nih.gov/pmc/articles/P… Image
So, for the area of influence that contributes to a KDE around a given location, when using the Gaussian kernel, we should consider at least 2 sigma, i.e., 2 times the bandwidth (which is just 1 sigma). And that still leaves 14% poking out around the edges. Image
The 2D Gaussian can have an elliptical footprint, and this can be rotated away from the coordinate axes, to accommodate diagonal trends.
The 2D Gaussian calculated by ‘ks’ is generally elliptical and rotated.

Image credit: commons.wikimedia.org/w/index.php?ti… Image
The 2-sigma areas of influence for the linked+unlinked KDE (Fig 1b in Worobey et al), using the default bandwidth matrix. (Rotation is minor and is ignored for this purpose). Image
The 2-sigma areas of influence for the linked+unlinked KDE (Fig 1b in Worobey et al), using the default bandwidth matrix divided by 20. (Rotation ignored.) Image
Side-by-side comparison of areas of influence for default bandwidth matrix as used in Worobey et al Fig 1b (left) and default bandwidth matrix divided by 20 (right). ImageImage
To address a subtweet criticism by @Samuel_Gregson : reducing bandwidth amounts to “overfitting noise”.
If one considers the KDE as purely data visualization, then overfitting is not applicable.
If one considers the KDE to have predictive power, then overfitting is an issue.
KDE probability contours with the default bandwidth/20 have a weird shape compared to the smooth contours generated by the default bandwidth.
Is this weird shape simply overfitting that would interfere with use of the KDE to predict where unknown case-residences might be found? Image
The weird shape of the reduced-bandwidth KDE is in fact a good match to population density. It excludes areas of low population density.
As such, it would be a better predictor of where one would expect to find unknown case-residences, as compared to the default-bandwidth KDE. Image
Unsurprisingly, the case-residences largely correspond to areas of high population density, and are sparse or absent in areas of low population density.
The default-bandwidth KDE underfits the data: it would predict case-residences in areas of low population density. Image
In R package ‘ks’
cran.r-project.org/web/packages/k…
it is possible to use bandwidth algorithms other than the default Hpi used by Worobey et al.
The Hnm bandwidth yields a pattern similar to Hpi divided by 20.
This may be because the Hnm algorithm is tuned to recognize clusters of data. Image
Self-critique: this map and interpretation ⬇️ will have to be revised, as there are issues with the Worldpop constrained dataset used for the population density layer of the map…
…for details on the issues with the Worldpop data, refer to this and subsequent tweets (in a separate thread).

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Daniel A. Walker 🇨🇦🇺🇦🇬🇱🌻😷💉🚴🏻

Daniel A. Walker 🇨🇦🇺🇦🇬🇱🌻😷💉🚴🏻 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @danwalker9999

Dec 18, 2023
@emilyakopp Offshoring of risky virological lab work to countries with lower or zero biosafety requirements is an old strategy.
Lassa/Ebola lab, Kenema Government Hospital, Sierra Leone.



reuters.com/article/us-bio…
who.int/news-room/feat…
ncbi.nlm.nih.gov/pmc/articles/P…
vhfc.org


Image
Image
Image
@emilyakopp Ebola at BSL-2.


Supplementary Figure 1A. ncbi.nlm.nih.gov/pmc/articles/P…
ncbi.nlm.nih.gov/pmc/articles/P…

Image
Image
@emilyakopp Setting up Ebola PCR lab at BSL-2, Kenema Government Hospital, 2014.
Fig 2A of

ncbi.nlm.nih.gov/pmc/articles/P…
ncbi.nlm.nih.gov/pmc/articles/P…
ncbi.nlm.nih.gov/pmc/articles/P…

Image
Image
Read 4 tweets
Oct 14, 2023
Table 1 of post-Erratum version of Pekar2022 gives
“*BF > 32; **BF > 10; ***BF > 100”.
Typo: that 32 should be 3.2.

@nizzaneela @Rebecca21951651 @jbkinney @DalthorpDan
@DrHermiz @gdemaneuf @BiophysicsFL @Biorealism @R_H_Ebright @CD57227 @ScienceMagazine science.org/doi/epdf/10.11…Image
Erratum credits these BF cutoffs to Kass and Nately (1995), which I assume must be
sites.stat.washington.edu/raftery/Resear…Image
Kass and Nately (1995) did not develop those BF cutoffs.
They copied them from the 1961 version of Jeffreys’ Theory of Probability, lumping “Strong” and “Very Strong” into “Strong”. Image
Read 14 tweets
Apr 11, 2023
Anyone who believes that China has or will provide reliable data to elucidate the origin of COVID should carefully study the statements made in this April 8 press conference.
Links to full conference appended.
@Ayjchan @JamieMetzl @mvankerkhove @mstandaert
news.cgtn.com/news/2023-04-0…
Entire press conference with voice-over English translation of Standard Chinese (54m)
news.cgtn.com/news/2023-04-0…
Entire press conference in original Standard Chinese (51m29s).
english.scio.gov.cn/pressroom/node…
Read 8 tweets
Nov 24, 2022
The geospatial analysis in Worobey2022 relies on a centering model to determine the origin point of COVID in Wuhan Dec 2019.
This model is not valid.
doi.org/10.1126/scienc…
The centering model can be stated as follows: the spatial pattern of the home residence of severe cases is centered on the origin point, with spatial density decreasing away from the origin point.
Fig 2A and B of Worobey2022 provide insight into the authors' logic.
“We hypothesized that if the Huanan market were the epicenter of the pandemic, then early cases should fall not just unexpectedly near to it but should also be unexpectedly centered on it”
Read 23 tweets
Aug 27, 2022
This thread examines two claims in Worobey et al:
Dec-2019 COVID case-residences in Wuhan were not concentrated in (1) areas of high population density or (2) areas with a high proportion of older persons.
science.org/doi/10.1126/sc…
The specific claims in Worobey et al:
🌐Dec-2019 cases did not reside in areas with high population density of (1) all age groups or (2) older persons.
🌐Fig 1E, S9 and S10 are enlisted to support the claims.
Fig 1E purports to represent the spatial distribution of COVID cases in Wuhan in Jan-Feb-2020.
It includes no population density data, and therefore cannot be used to support the claims.
Read 25 tweets
Aug 15, 2022
Worobey et al. (2022)
science.org/doi/10.1126/sc…
Consider the KDE probability contours for the residences of Dec 2019 cases.
Data from zenodo.org/record/6908012…
*Linked* cases in green.
The map of the linked-cases KDE was omitted from the article... Image
...although the KDEs for all-cases and unlinked-cases were prominently displayed on Fig. 1, and featured in various tweets emitted by the authors. Image
The map ☝️uses the following from the data files supplied with the article at zenodo.org/record/6908012
▸ maps ▸ geojson
▸ who_cases_dec-2019.linked.KDE.contours.geojson
▸ who_cases_dec-2019.notLinked.KDE.contours.geojson

▸ data
▸ who_cases_dec-2019.csv
Read 14 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(