New project on causal language and claims, and I want you to see how everything goes down live, to a mind-boggling level of transparency.
That includes live public links to all the major documents as they are being written, live discussion on major decisions, etc.
Worth noting: this is the second time I've tried this kind of public transparency; the previous paper got canned due to COVID-related things.
Here's the idea (at the moment, anyway): health research has a very complicated relationship with "causal" language.
There is a semi-ubiquitous standard that if your study isn't the right method or isn't "good enough" to for causal estimation, you shouldn't use the word cause, but instead just say things are association/correlated/whatever, and you're good to go.
This is ... problematic.
Lots of potential rants here for why, but suffice to say this standard creates all kinds of issues with study strength, communication, and usefulness. This is a problem I've been working on for years.
But how common is this, and is it a big problem?
One symptom of this disconnect is that a lot of papers may make a lot of action claims about their studies that would require causal estimation, but according to the language, are "just association."
So, what do we want to know?
1) How do *typical* journal publications phrase the relationships between exposures and outcomes?
What actual words are used (correlate, effect on, associate, cause, etc)?
With what modifiers (may be, strongly, etc)?
How common is "just say association"?
2) Do the claims and action implications made in the paper imply or require a causal estimate?
E.g. do the implications from the paper suggesting doing more, less, or the same amount of X in order to increase or decrease Y?
That's implicitly causal in nature.
So, the plan:
1) Take a giant randomly selected, screened sample of X vs Y-type articles in the health literature.
2) Recruit a giant multidisciplinary team of awesome people
3) Have them determine what the phrases used and what the claims made are, based on guidance.
* narrator: this is not easy, but at least its fun!**
** narrator: fun-ish.
First thing that needs to be done is developing a rough protocol.
My task of the day is making a messy terrible outline of what I think this should look like.
Good news is that I've had enough proposals in this arena that I can stitch it together from scraps leftover from ~4 years of failed grant proposals and cancelled projects that this shouldn't be that hard, right?
Some time this afternoon, I'll open a blank document for the protocol draft, and share the link with the world, so you can watch / comment / suggest everything as I go, and can see just how terrible I am at writing.
And as a bonus from now to about 5ish, I'm gonna stream it all on Twitch, come join!:
Day 1: Got a decent outline of the protocol draft. Goal is to get a full shitty draft for tomorrow, and send around for potential protocol co-authors / revisions.
I want a full, presentable protocol by the end of the month, because this is going to be an aggressive timeline.
Bad protocol chalkboard draft #1 written and done.
Now in get-core-team-together mode, to be followed shortly by absolutely-massacre-original-draft-and-like-a-phoenix-a-decent-draft-will-be-reborn-from-the-ashes mode
Status update: protocol is getting close to done, protocol coauthor team finalized, reviewers are being recruited and we're having one of several intro meetings tomorrow morning.
Good thing I definitely for sure planned ahead and made slides.
Only thing that's not public is stuff that contains personal information; everything else is public.
Welp, things are going. Here's where we're at:
- Team recruited and on Slack
- Currently putting the final touches on the protocol
- Wrote/ran the search code
- Team divided into screeners and review tool piloters
- Meetings scheduled for the training sessions
This is definitely a work weekend for me, lots of moving parts and administration for which I am the bottleneck.
Good news is that the team is AWESOME. Particular shout out to @SarahWieten for taking a whole bunch of responsibility (including boring stuff).
@SarahWieten Relatedly, we had so much interest in this study that we had to narrow down a list of 150+ people down to a 50ish person final team.
Decisions were based on a lot of things, but notably maximimally diverse representation among qualified people.
@SarahWieten Which is to say we had to say no to a whole lot of people who are super awesome and super qualified.
If I had known how much interest there would be, possible I could have redesigned things to work with a bigger team. But alas.
@SarahWieten Hard deadline looms though. We've already used up just about the entire buffer already (granted, planning is the "high risk of delays" stage).
Doing a first-of-its-kind project with a massive team and lots of unknown unknowns is a particularly Noah style of bad idea.
Protocol pre-registered, screening process and review tool piloting start roughly simultaneously tomorrow.
Feels like things are a touch more rushed than I would like, but so it goes. Good news: pre-registration is not a stone tablet. If we need to make changes, we'll make them.
For whatever reason, the screening is always always always the most chaotic part of these projects.
Hiccups abounded, but screening is well underway (albeit a touch behind schedule due to said hiccups).
Main review training starts on Monday!
One hiccup was just a straight up coding error that was my fault, but others were more about the sampling and screening design due to some unexpected interactions. Lessons learned.
Pretty much inevitable with a first-of-its-kind sortof project, but can be frustrating.
While the screening's been going on, @SarahWieten has been leading a team to pilot the review tool and giving really incredibly helpful suggestions.
The many-commenters model is a lot of work for sure, but it absolutely makes a HUGE difference to the end product.
Really really looking forward to the main review phase starting (after the inevitable round of fires have been put out, of course).
I've been going nonstop on this project for a few weeks now. Will be nice to take a break.
Inching ever closer to launching the main review phase, currently desperately putting the final touches on a dozen things before we commit.
Side note: I think I've worked harder on this over the last few weeks than I've worked on just about anything.
A brief recap of the last 2 weeks:
Estimated person-time for the main review alone is just a hair over 1,000 person hours between ~50 coauthor reviewers.
That's not even counting the screening and piloting process, design, admin, analysis/writing., etc.
This thing is a MONSTER.
AND WE'RE OFF! Data collection has officially started for the main review.
I've been working on getting to this moment for YEARS and it's awesome to see it happening
Progress is happening
Primary review phase wraps up (ish) today! Next week is the arbitration review phase, plus a bit of extra ratings and such.
But the end of the data collection phase is in sight.
One side effect of this study is that a lot of extremely smart people are seeing what a reasonably representative random sample of the high-impact medical / epi journal literature actually looks like.
Reactions have been pretty interesting.
By request, I am doing an improvised stream of how the back end of all this works on Thursday, July 22 at 10am eastern.
How do you organize the code and interface of a complex multi-phase, 50+ person 1k+ article 3,000+ reviews study?
These mega collabo projects can be monstrous, but good golly it's magical sometimes.
I was short on time to write, so I sent a quick message to the group to see if someone could handle the intro, and BOOM @dingding_peng wrote an awesome 1st draft, WAY better than I would have.
100% of article reviews completed!
Still so, so much left to do, but this is the point at which we officially have enough data to meet our primary analysis goals.
Going to reflect on a few things to getting here.
Firstly, the screening part turned out to be the most chaotic phase, and the main review went mostly fairly smoothly.
Screening is the point where you have a logistically hard proble, the least info, and untuned systems.
It was EXTRA chaotic due to the requirement of accepting the same number of articles per journal as a stopping, with wildly different acceptance rates per journal, with feedback loops for screener assignments.
Doing that involved a lot of pain and chaos. Do not recommend.
I also messed up and created some extra work due to a very stupid code bug that resulted in excluding two very important journals, which was not caught until late in the process.
Fortunately, the system was built such that fixing it wasn't a huge problem. But still.
Then there's just the general chaos of doing a complicated and way out of ordinary project, with very unusual framing and methods, requires constant tweaking and changes, etc.
Doing something weird is always tough.
And then there's the fact that this project involves carefully coordinating, training, and synchronizing 50 (!!!) people, where everything needs to mesh at precise times and multiple phases, and any one unmeshing issue throws the whole thing out of whack.
As before, the only thing that isn't public is personal info, so I can't and won't talk about specifics.
But some tough situations arose, some unavoidable, others perhaps avoidable.
By and large though, the crew is/was ASTOUNDINGLY amazing, and my favorite part of these things.
Now we're on the cleanup phase, where there is a tough balance to be hard. I have to maintain three conflicting goals:
1) Data quality 2) Being a reasonably neutral party to avoid over-influencing reviewer decisions
and 3) Timelines
Can't get all 3 perfectly.
Have a bit more data collection to do, but the next phases are the analysis and manuscript writing phases.
And because I am me, I am going to do this the hard way, with hypertransparency engaged.
That means everyone can see all the not-so-pretty parts of the sausage making.
For a sense of scale, what you see in that chart was the work of 49 people across the world, carefully synced and coordinated, with a complex multiphase process, using a first of its kind guidance and review...
In *42 days* from first screen to last data collected.
I am looking forward to never working this hard ever again.
But no rest yet.
Because I have 28 days left of my fellowship to get this written and submitted.
The results section is being written, figures and statistics are being dropped, come check it out!
The "big" result and data are being dropped and written right now.
To what degree does the strength of causal implications in the sentence linking exposure to outcome match the causal implication of action recommendations (i.e. what the authors say you should *do* with the data)
Nearly done writing up a first draft of the results.
Also, just drafted a nearly 2 page document detailing changes from the original protocol, of which there were many.
Doing something new and weird means running into unexpected weird problems, and plans change.
Big one was that we ended up using a much more direct and context-sensitive measure of linking language causal strength, scrapping the original (over-complex and probably worse in every way) assignment and rating process.
Preregistration is SUPER useful, but not a stone table,
Aaaaand first (bad) draft of the results section is written. On to the discussion section this week.
And boy howdy what a discussion section it's gonna be. I tend to think the results are pretty damning (including in some ways that surprised me).
Now first bad draft of the Discussion!
I expect most of this to get rewritten a few times over, but the first bad draft is the hardest part.
Entering the phase where 90% of the paper is done.
@OSFramework I always find this stage of a paper to be tough. We know what the results are and what we want to say. The big stuff is done; we're 95% of the way there.
But there are a thousand small tasks that make up the other 95%.
@OSFramework To make a woodworking analogy: all parts are built and more or less assembled.
Everything else from here is sanding, finishing, and getting it installed.
There's just so, so much sanding.
One REALLY tough thing in this paper is just how much tiptoeing we have to do for internal consistency in how we describe things.
In our case, we can't merely "just use the right words," we have to make DAMN sure that we also don't make any possible inappropriate implication.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Folks often say that DAGs make our causal inference assumptions explicit. But that's only kinda true
The biggest assumptions in a DAG aren't actually IN the DAG; they're in what we assume ISN'T in the DAG. It's all the stuff that's hidden in the white space.
Time to make it official: short of some unbelievably unlikely circumstances, my academic career is over.
I have officially quit/failed/torpedoed/given up hope on/been failed by the academic system and a career within it.
To be honest, I am angry about it, and have been for years. Enough so that I took a moonshot a few years ago to do something different that might change things or fail trying, publicly.
I could afford to fail since I have unusually awesome outside options.
And here we are.
Who knows what combination of things did me in; incredibly unlucky timing, not fitting in boxes, less "productivity," lack of talent, etc.
In the end, I was rejected from 100% of my TT job and major grant applications.
Always had support from people, but not institutions.
Ever wondered what words are commonly used to link exposures and outcomes in health/med/epi studies? How strongly language implies causality? How strongly studies hint at causality in other ways?
Health/med/epi studies commonly avoid using "causal" language for non-RCTs to link exposures and outcomes, under the assumption that ""non-causal"" language is more ""careful.""
But this gets murky, particularly if we want to inform causal q's but use "non-causal" language.
To find answers, and we did a kinda bonkers thing:
As if that wasn't enough, we also tried to push the boundaries on open science, in hyper transparency and public engagement mode.
Granted, we only see the ones that get caught, so "better" frauds are harder to see.
But I think people don't appreciate just how hard it is to make simulated data that don't have an obvious tell, usually because somethig is "too clean" (e.g. the uniform distribution here).
At some point, it's just easier to actually collect the data for real.
The ones that I think are going to be particularly hard to catch are the ones that are *mostly* real but fudged a little haphazardly.
Perpetual reminder: cases going up when there are NPIs (e.g. stay at home orders) in place generally does not tell us much about the impact of the NPIs.
Lots of folks out there making claims based on reading tea leaves from this kind of data and shallow analysis; be careful.
What we want to know is what would have happened if the NPIs were not there. That's EXTREMELY tricky.
How tricky? Well, we would usually expect case/hospitalizations/deaths to have an upward trajectory *even if when the NPIs are extremely effective at preventing those outcomes.*
The interplay of timing, infectious disease dynamics, social changes, data, etc. make it really really difficult to isolate what the NPIs are doing alongside the myriad of other stuff that is happening.
The resistance to teaching regression discontinuity as a standard method in epi continues to be baffling.
I can't think of a field for which RDD is a more obviously good fit than epi/medicine.
It's honestly a MUCH better fit for epi and medicine than econ, since healthcare and medicine are just absolutely crawling with arbitrary threshold-based decision metrics.
(psssssst to epi departments: if you want this capability natively for your students and postdocs - and you absolutely do - you should probably hire people with cross-disciplanary training to support it)