My Authors
Read all threads
All right folks, quick stats lesson. (Thread)

There's a meme floating because of this article 👇 that "China must be faking their Coronavirus data because you never see R^2 = 0.99 with real data".

barrons.com/articles/china…
If you run a regression of cumulative deaths reported in China in the first half of February, and look at the residuals, you get a magic parabola like this: Image
"Oh my gosh," you say to yourself. "FAKE DATA!!!" So you add in a quadratic term to the regression, and get R^2 = 0.9999 or something like that. You've cracked the code! It's a smoke signal from a Chinese researcher! Time to call Barron's!
Well, here's the problem. OLS regression requires independent observations. Cumulative data is NOT independent. Today's number is always yesterday's number, plus something. So the regression diagnostic numbers (R^2 etc) are all bunk.
Also: Cumulative sums of linear functions are.... quadratic. (Gauss's formula etc) All you've really discovered is that the number of new deaths reported each day are increasing linearly. This is not, in itself, evidence of data fraud.
If you run a simple linear regression on NEW deaths reported - which are independent(ish) observations - the R^2 comes out around 0.98. Still high, but not as insane as advertised, and there might well be underlying system dynamics that would cause deaths to grow linearly. Image
So how can we tell if we're looking at #FAKEDATA from China? If data came from a Poisson process (i.e. they are independent events), it will have the telltale property that Variance = Mean. If the conditional variance is far less than the mean, there's likely something fishy here
In other words: If the reported data fits a linear function more tightly than simulated data ever could (bc of the variance inherent in Poisson processes), THEN it's time to call the data police.

The conditional variance btw is called dispersion, which leads us to...
DISPERSION TESTING

Poisson processes have Variance = Mean, or a dispersion parameter of 1. Real-world count data is often "over-dispersed" with Variance > Mean, and a dispersion parameter > 1.

(Researchers btw usually use a negative binomial model for over-dispersed data.)
If the data has LESS variance than we'd expect from a random process (i.e. Variance < Mean), it's said to be under-dispersed. A good test of the data from China would be to test the null hypothesis that its dispersion parameter is >= 1. (Rejected null -> underdispersed data)
A DIY solution here is to fit the data to a generalized Poisson regression (which is actually a poor fit bc it would predict an exponential, not linear, increase over time, but whatever) and then run the "dispersiontest" function from the "AER" R package.

rdrr.io/cran/AER/man/d…
Even with this clearly misspecified function (which would tend to produce larger residuals and hence extra dispersion), the dispersion parameter is estimated at 0.503, with a p-value (comparing the dispersion to 1) around 0.003, or 3 in 1000. Image
So it's quite possible the data is fake, or at least received some massaging, e.g. to never show a decrease.

Western researchers should demand China's micro data - this is harder to fake, and will let us construct risk-factor models, controlling for gender, smoking & comorbidity
(Right now all we have are high-level cross-tabulations, which are very misleading because we don't know how much risk is due to age vs how much to pre-existing conditions, or how much to being male vs being a smoker)
Anyway, the lesson here is: R^2 is misleading and easy to misinterpret. If you get an R^2 of 0.9999, check your assumptions before blaming the data.

Also: please don't ever run time-series regressions on cumulative data.

And wash those hands. --Your neighborhood stats person
Missing some Tweet in this thread? You can try to force a refresh.

Enjoying this thread?

Keep Current with Evan Miller

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!