Thread by @RexDouglass on Thread Reader App

Are you interested in:
Causal inference? Data best practices? Time series? COVID-19? Awkward methods drama?
Then check out:
"How to be a Curious Consumer of COVID-19 Modeling: 7 Data Science Lessons from ... (Feyman et al. 2020)"
rexdouglass.github.io/Douglass_2024_…

Feyman et al. (2020) ask whether COVID-19 shelter-in-place orders actually kept people inside or if they were ignored and people would have stayed inside anyway.

It's a very hard and very important question!

They think they found an answer! They looked at Google's Mobility Report time series measures of how many people visited work, transit, retail, grocery, and parks. They report finding a discontinuity in the time series right about when SIP orders were passed: -12%!

They argue that the estimated discontinuity represents a causal effect. They borrow from regression continuity designs, where a cut point arbitrarily assigns units to treatment or control within some ball. They apply it to time series- regression discontinuity in time (RDiT).

Does their identification strategy allow us to interpret that estimate as causal? It does not! The very difficult assumptions of an RDiT design do not hold for the apartment building fire that was the first few weeks of DEADLY PANDEMIC in the U.S. In fact, it's a terrible case.

Does it show an association between SIPs and a -12% in mobility right after? Also no!
3 different artifacts produce the result:
1. Clipping 2 days from existence in the TS
2. Averaging 5 pre-normalized trends that shouldn't be averaged
3. Using data how producers warned not to

There's a lot to learn from this paper! I've distilled 7 lessons that apply broadly to every observational COVID paper I've seen and for any project I've ever been on. Let's go through them one by one.

Lesson 1: You cannot stare at a correlation so hard that it becomes causal

The idea of RDiT is that if you full specify the data generating process of an outcome Y, and the DGP of the timing (T) of a treatment (D), you might see a clever window where the two are unrelated.

The modeler has to show in detail how a small window disentangles the DGP of T and Y.

The DGP for mobility and for SIP announcement are never specified but they're unlikely to be separable! An unobservable called the DEADLY PANDEMIC was the biggest cause of mobility and of SIPs.

No matter how small you crank that window, it doesn't disentangle two incredibly complicated and highly related DGPs for you. Or rather it doesn't do it automatically. The modeler has to then spend pages grinding all the paths, applying strong priors to close them. It's not free.

It's conceptually tough for SIPs which were the last act of states turning off. Do the credits rolling at the end of a movie ‘cause’ the movie to end? Does being buried in a cemetery ‘cause’ you to die? You’ll get a very sharp discontinuity in mobility guaranteed.

Lesson 2: Make sure the modeling is fully computationally reproducible
The code doesn't match the paper in an important way. They completely remove two time steps (3 days) from the data before running the analysis. Instead of setting them to NA and letting the model interpolate.

Lesson 3: Don’t make physically impossible comparisons
Removing literal days of space-time has a mechanical consequence of creating a gap in smooth processes. A ball falling will suddenly teleport. They interpret that teleportation as an effect from an event in the missing time.

If you don't Twilight Zone away those 3 days and let the model interpolate in real time, the gap shrinks / goes away. Under different choices of points:
-0.6% [<=-2,>=1] no anticipation days
-1.9% [<0,>=0] sharp RDD
-3.7% [<=-1,>=1] their code
-6.5% [<=-1,>=1] their paper

Lesson 4: Make sure the outcome measures what you say it does
Google explicitly warns not to make day-to-day comparisons.
The data come pre-normalized, for relative mobility on a given day of the week against a Jan baseline.
So Mon/Tues, Weekday/Weekends generate diff % shifts

Lesson 5: Don’t ignore systematic missingness
The main outcome is the mean over 5 mobility types. If any of those 5 types drop in and out over time, it will mechanically shift the outcome. This happened all the time across their panel with huge systematic dropouts everywhere.

Up to a third of the counties also disappeared regularly from day to day. This is another way their strategy of shrinking the window you look at to smaller and smaller actually makes things worse. The sample frame radically changes from just before a cut to just after (Z->Y).

Lesson 6: Don’t aggregate things that shouldn’t be aggregated
The 5 mobility types are pre-normalized. Averaging them together doesn't cancel I.I.D. noise, it smooshes 5 different DGPs into an abomination that reflects none of them. Also weights Park going equal to all Work.

Disaggregating and removing the Twilight Zone skip
-No discontinuity at all in Work and Transit
-No discontinuity in Grocery and Retail if you remove the 2 days of anticipation panic buying
-Only maybe a discontinuity in Park going. The noisiest and smallest activity.

Lesson 7: Verify your result with case studies
They represent South Carolina as a null case with no discontinuity. When we look at the individual time series we see lots of jerking but no discontinuities. Plot other steps passed, we see it was an entire month of treatments.

New York they present as a poster child for their result. Huge -32% causal effect.
Doesn't exist in the disaggregated data. Instead saw tooth pattern in Work, and many huge +50% -50% swings in Parks right at their treatment create a hallucination of a huge discontinuity overall.

Same story for their next strongest case. Huge precise -28% causal effect in Mass. is due entirely to +100 to 0 in Parks, with some small saw pattern in Work. They interpreted these huge effects in Democratic states as treatment heterogeneity from politics. They didn't exist.

Conclusion: The burden is always on the modeler, never the reader
This isn't the only observational COVID paper to do this. It's not the 10th. Not the 100th. Nearly all of them do terrible data due diligence and measurement in order to sell a cute identification strategy.

Modeling is a never ending thankless checklist of things that must all be true or none of it is true. It is a mathematical proof. It is the modeler's job to get as far into that checklist as possible and make clear what's undone. It's not the reader's job to guess. Do your job.

I wish to sincerely thank the authors for inviting me to review their paper. If you like this review, it has a companion from the start of the pandemic when Richard Epstein famously claimed the pandemic would fizzle out and intervention wasn't needed. rexdouglass.github.io/TIGR/Douglass_…

I need to go back and cite this in my discussion of how policies aren't randomly timed with respect to every other co-variate in the middle of a global disaster.
papers.ssrn.com/sol3/papers.cf…

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll