Quality of the #2020Census data -- a session on race and ethnicity coding.
Census follows the OMB 1997 standards, does not set the standards: 2 categories of ethnicity, 5 categories for race (AIAN, Asian, Black/AA, Native American/Pacific Islander, White) + "some other race"
The race and ethnicity questions help understand how people self-identify, so research into these is necessary to understand how the U.S. population evolves (more multiracial, more diverse than measured in the past)
There were some proposals to start offering "Middle Eastern / North African" (MENA), but they did not make it to the #2020Census.
4 categories of Hispanic:
- Mexican/Mexican Am/Chicano
- Puerto Rican
- Cuban
- Yes, some other Hispanic -- other countries of origin are listed as examples
Race questions: White and Black had "Print" for write-in lines of detailed origin (German white or Nigerian black, say); American Indian / Alaska Native could add enrolled/principal tribes; multiple Asian categories ~10 countries of origin + Other Asian + Other Pacific Islander
Added: new White detailed groups, new Black detailed groups, many AIAN/PI groups. About 99% of write-ins were automatically coded; clerk review for the residual write-ins. #2020Census: 350.5M write-ins, c.f. 54.7M in the 2010 Census
Up to six write-in responses, up to 200 characters per write-in line, no prioritization (c.f. two writeins, 30 characters, priority Hispanicity in the 2010 Census)
(I think I wrote in "Beige" as suggested by my American friends who said there is no way I should refer to myself as "White".)
(And the most meaningful "race" question I have seen was from one of the non-gov organization "White" "Black" "Asian" "Hispanic" "Native American" "immigrant". I could nail that without hesitation!)
2020 ethnicity breakdown: 62.1M Hispanic or Latino, 269.4M not Hispanic
Some other race 49.9M surpassed Black 46.9M as the second largest group after Whites.
The race/ethnicity results should be compared with caution between 2010 and 2020 due to these changes in measurements. [The 2020 was much improved from the methods perspective for the snapshot measurement, but of course for trends, the changes are bad -- S.K., not Census]
Q from the floor: the mission of the Census was to count every person once, exactly once, and in the right location. What about counting the race exactly?? [a trick question indeed].
A: we were not surprised by the #2020Census given what we saw in the 2015 content test (and that whole decade of testing). Before, in 2010, we saw good agreement with demographic projections. That's the best we can say.
Q: @dcvanriper asked for clarification regarding how "Cuban + Thai + Filipino" ended up as "Some Other Race + Asian".
A: all responses in the race write-in space were mapped to race [and apparently did not get matched to any major category groups -- S.K.]
Q: while you created a wealth of information on detailed race categories, people on another floor in your building are working hard to obfuscate that.
Q [opinion really] the improvements in race / ethnicity measurements did not go far enough because of the political appointees in OMB
As a general comment, Zoom organizers should pay exponentially more attention to the chat. Neither the camera nor the mic are working on this computer that I am connecting with, so I am limited to the damn chat @uscensusbureau looking at you.
The 2020 #ACSdata used the same question format and the coding scheme/algorithms as the #2020Census
Connie Citro observed that the vital records do not have the detailed coding used in the Census/ACS. Vital records are used down the line for the population projections that in turn used as controls for #ACSdata that in turn used to weight all gen pop surveys. What's going on?
By how much do response rate differ between the different population groups? Today, we will explore this with the @BRFSS data (cdc.gov/brfss/annual_d…) 🧵👇 1/11
The data set makes it possible to study these differentials as it is one of the relatively rare data sets with both the design weights and the calibrated weights... in our case, _WT2RAKE and _LLCPTWT. 2/11
(It's not me who likes to YELL, it is CDC. If you don't like the variable names, janitor them the way do like) 3/11
#JSM2021 panel led by @minebocek on upskilling for a statistician -- how to learn??
@minebocek#JSM2021@hglanz no shortage of stuff to learn. First identify what you don't know -- that comes from modern media (blogs, twitter, podcasts; groups, communities -- @RLadiesGlobal or local chapters; professional organizations -- @amstatnews ).
#JSM2021 an exceptionally rare case of ACTUAL out of sample prediction in #MachineLearning#ML#AI: two rounds of the same health data collection by @CDCgov
@CDCgov Yulei He @cdcgov#JSM2021 RANDS 1 (fall 2015) + 2 (spring 2016): Build models on RANDS1 and compare predictions for RANDS2
#JSM2021 Yulei He R-square about 30%; random forests and grad boosting reduce the prediction error by about 4%, shrinking towards the mean; standard errors are way to small (-50% than should be)
1. when will the survey statisticians in the U.S. move from weird variance estimation methods (grouped jackknife) to simple and straightforward (bootstrap)
and
2. when will they move from weird imputation methods with limited dimensionality and limited ability to assess the implicit model fit (hotdeck) to those where you explicitly model and understand which variables matter for this particular outcome (ICE)?
Oh and somebody reminded me of
3. when will we move from PROC STEPWISE to lasso as the rest of statistics world has
#JSM2021@jameswagner254 Using Machine Learning and Statistical Models to Predict Survey Costs -- presentation on the several attempts to integrate cost models into responsive design systems
#JSM2021@jameswagner254 Responsive designs operate on indicators of errors and costs. Error indicators: R-indicator, balance indicators, FMI, sensitivity to ignorability assumptions (@bradytwest@Rodjlittle Andridge papers).
Some decisions are made at the sample level (launch new replicate, switch to a new phase of the FU protocol), others at case level (change incentive amount, change mode)
Now let's see how @olson_km is going to live tweet while giving her own #JSM2021 talk
@olson_km#JSM2021@olson_km Decisions in survey design: questions of survey errors and questions of survey costs. Cost studies are hard: difficult to offer experimental variation of design features, with a possible exception of incentives. Observational examinations are more typical.
#JSM2021@olson_km When you have one (repeated) survey at a time, you can better study the impacts of variable design features (but can't provide the basis for the features that do not vary.)