Profile picture
Stas Kolenikov @StatStas
, 17 tweets, 24 min read Read on Twitter
@MCLevenstein @MaryELosch @ICPSR @bradytwest @NAHDAP1 @DSDRdata Well in this case (as is the case with many other documents written by statisticians who assume that every researcher knows enough survey statistics to connect the dots), the documentation does not explain the use of complex weights. It just says, "weights should always be used".
@MCLevenstein @MaryELosch @ICPSR @bradytwest @NAHDAP1 @DSDRdata A clear specification should be:
- in Stata, this is your -svyset-
- in SAS, this is your PROC SURVEY ; WEIGHTS = ; CLUSTER = ; STRATA = ; setup
- in R, here's your svydesign
so that researchers could pick and drop this into their analyses.
@MCLevenstein @MaryELosch @ICPSR @bradytwest @NAHDAP1 @DSDRdata There will be cases when you would also need to say something like,
For household analyses, the specification is [BLAH]
For analysis of adults, the specification is [BLAH]
For analysis of children, the specification is [BLAH]
For analysis of urine samples, ...
@MCLevenstein @MaryELosch @ICPSR @bradytwest @NAHDAP1 @DSDRdata The India documentation is written for SAS. It is the only remaining software that really requires the data to be sorted before merging. Both R and Stata handle these minutiae on their end.
@MCLevenstein @MaryELosch @ICPSR @bradytwest @NAHDAP1 @DSDRdata Also, my guess is that PSUID is... well, the ID of the PSU, and must be specified as a cluster variable in complex survey analysis.
@MCLevenstein @MaryELosch @ICPSR @bradytwest @NAHDAP1 @DSDRdata As an instructor of complex survey data analysis using Stata classes, I have been given my students examples of well written documentation that connects the dots well, and less clear documentation.
@MCLevenstein @MaryELosch @ICPSR @bradytwest @NAHDAP1 @DSDRdata The less clear documentation would be calling PSUs "pseudo-replicates" or "variance estimation units" or some other technical lingo. This is a signal from one survey statistician (sample designer) to another (data user) that these aren't the true strata and PSUs.
@MCLevenstein @MaryELosch @ICPSR @bradytwest @NAHDAP1 @DSDRdata But it is a very minor technical point compared to anything else the end user researcher would really have to care about this study... dealing with missing values would be a more important point, for instance.
@MCLevenstein @MaryELosch @ICPSR @bradytwest @NAHDAP1 @DSDRdata This is all to say that my workflow of figuring out how to specify the survey settings in a new data set that I have to deal with is the following.
1. Search documentation for "svyset" as a keyword.
@MCLevenstein @MaryELosch @ICPSR @bradytwest @NAHDAP1 @DSDRdata 2. If that fails, search for "sampling weight", "final weight", "analysis weight" or "design weight" (because "weight" per se may produce just way too many false positives).
3. See if there is any description of strata and clusters near the text where weights are mentioned.
@MCLevenstein @MaryELosch @ICPSR @bradytwest @NAHDAP1 @DSDRdata 4. Search for "PSU" and "cluster" and "strata" and "stratification" and see if I can find the variables that I need to specify.
5. Search for "replicate weights", "BRR", "jackknife" and "bootstrap".
@MCLevenstein @MaryELosch @ICPSR @bradytwest @NAHDAP1 @DSDRdata In the tobacco data set you linked, there is a reasonably clear description that the weight/cluster/strata/replicate weight variables are in a standalone file. As @bradytwest's paper shows, separating them out in a creates a danger that the researchers never use them.
@MCLevenstein @MaryELosch @ICPSR @bradytwest @NAHDAP1 @DSDRdata On the other hand, you can clearly expect that whatever is in that dataset will be sufficient to specify the survey settings for any analysis.
@MCLevenstein @MaryELosch @ICPSR @bradytwest @NAHDAP1 @DSDRdata The India documentation insists on using the weights, but my additional search for PSU would bring up "PSUID" on the webpage, and furthermore in the PDF file, you can find that stratum is `id1` and `hi1` variables (do they agree??).
@MCLevenstein @MaryELosch @ICPSR @bradytwest @NAHDAP1 @DSDRdata Again, I come to this data set with the knowledge of survey statistics. Other users would not have it (as I would not have their knowledge of reproductive behaviors or addictions or whatever substantive stuff they are interested in), and would stop in their tracks much earlier.
@MCLevenstein @MaryELosch @ICPSR @bradytwest @NAHDAP1 @DSDRdata My apologies for a very long tweetstorm. tl;dr -- your documentation standard may want to include a clear requirement of having the survey variables isolated and formed into Stata, SAS and R statements. An example of exactly what I want is ANES -- icpsr.umich.edu/icpsrweb/ICPSR…
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Stas Kolenikov
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($3.00/month or $30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!