Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Stas Kolenikov

@StatStas

Oct 12, 2021 • 18 tweets • 13 min read • Read on X

Scrolly

@ipums

Webinar on @ipums data @popdatatech @nhgis @ipumsi and I am probably forgetting their other accounts - human population data…

@dcvanriper

… geographic data @dcvanriper …

@ipums

@ipums meets #rstats via @gregfreedman

#rstats backbone infrastructure of library(ipumsr)

@ipums

@ipums ipumsr relies on the @DDIAlliance DDI code book metadata format for approximately everything

Some internal structure of the DDI codebook objects in ipumsr

Variable names and labels and value labels are available

Your daily reminder that #rstats factor variables suck big time compared to Stata and SAS

I am not crying, you are crying

Helper functions: replace missing values. These suck in #rstats compared to Stata and SAS, too: in the latter, you can have extended missing values .a:.z, and they all can be labelled with the reasons it is missing (not in pop, refused, not applicable, etc.)

Another helper: recode and label new values. Very helpful for hierarchical classification codes where you can divide by 10 or 100 to obtain the higher level / fewer digits code. Labels are copied from the lowest numbered original category - would need extra work down the line

A more generic function for that is relabel.

@ipums

Now let’s talk about @ipums @nhgis geographic data - tied to library(sf)

There are helper functions for the geo data management

@popdatatech

Coming soon - @popdatatech API - ask to beta test. Sharing extract definitions is important for reproducibility!

Ceci n’est pas une pipe

Respond to #Reviewer2 without leaving #rstats

@ipums

Everything you need to know about @ipums data and this presentation

@threadreaderapp

@threadreaderapp unroll

All others: please retweet this last entry rather than the first ;)

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @StatStas

Stas Kolenikov

@StatStas

Jun 15, 2023

@rnishimura

@rnishimura @awmercer @jon_m_rob @kwcollins @bradytwest +1 to Andrew, +1 to Brady. @JerryTimbrook had this Lickert scale vs. Like-ert scale @AAPOR-award-winning meme. It it time to produce "entropy balancing is Case 2 of Deville and Sarndal (1992)" meme.

@rnishimura

@rnishimura @awmercer @jon_m_rob @kwcollins @bradytwest @JerryTimbrook @AAPOR It is unfortunate that economists continue to refuse to read the survey stats literature, and overblow the significance and originality of their work, but that's an uphill battle that myself and @bradytwest and @rnishimura and @MdtvieiraS are not in a position to fight.

@rnishimura

@rnishimura @awmercer @jon_m_rob @kwcollins @bradytwest @JerryTimbrook @AAPOR @MdtvieiraS To answer the very literal @kwcollins question:
1. Raking as an algorithm is slow as it opitmizes one margin at a time. Deville & Sarndall's class of methods, including entropy balacing, approaches the problem from an optimization angle, so faster optimizers can be applied.

Read 18 tweets

Stas Kolenikov

@StatStas

Sep 29, 2021

@BRFSS

By how much do response rate differ between the different population groups? Today, we will explore this with the @BRFSS data (cdc.gov/brfss/annual_d…) 🧵👇 1/11

The data set makes it possible to study these differentials as it is one of the relatively rare data sets with both the design weights and the calibrated weights... in our case, _WT2RAKE and _LLCPTWT. 2/11

(It's not me who likes to YELL, it is CDC. If you don't like the variable names, janitor them the way do like) 3/11

Read 12 tweets

Stas Kolenikov

@StatStas

Sep 28, 2021

Quality of the #2020Census data -- a session on race and ethnicity coding.

Census follows the OMB 1997 standards, does not set the standards: 2 categories of ethnicity, 5 categories for race (AIAN, Asian, Black/AA, Native American/Pacific Islander, White) + "some other race"

The race and ethnicity questions help understand how people self-identify, so research into these is necessary to understand how the U.S. population evolves (more multiracial, more diverse than measured in the past)

There were some proposals to start offering "Middle Eastern / North African" (MENA), but they did not make it to the #2020Census.

Read 22 tweets

Stas Kolenikov

@StatStas

Aug 12, 2021

@minebocek

#JSM2021 panel led by @minebocek on upskilling for a statistician -- how to learn??

@minebocek

@minebocek #JSM2021 @hglanz no shortage of stuff to learn. First identify what you don't know -- that comes from modern media (blogs, twitter, podcasts; groups, communities -- @RLadiesGlobal or local chapters; professional organizations -- @amstatnews ).

@minebocek

@minebocek @hglanz @RLadiesGlobal @AmstatNews #JSM2021 @hglanz What do the job postings require these days? (This is how the content for the @CalPoly stat/data science program was developed.)

Read 64 tweets

Stas Kolenikov

@StatStas

Aug 12, 2021

@CDCgov

#JSM2021 an exceptionally rare case of ACTUAL out of sample prediction in #MachineLearning #ML #AI: two rounds of the same health data collection by @CDCgov

@CDCgov

@CDCgov Yulei He @cdcgov #JSM2021 RANDS 1 (fall 2015) + 2 (spring 2016): Build models on RANDS1 and compare predictions for RANDS2

ridge, lasso, elastic net, PLS, KNN, bagging, RF, GBM, XGBoost, SVM, deep learning

#JSM2021 Yulei He R-square about 30%; random forests and grad boosting reduce the prediction error by about 4%, shrinking towards the mean; standard errors are way to small (-50% than should be)

Read 4 tweets

Stas Kolenikov

@StatStas

Aug 11, 2021

I have two general questions:

1. when will the survey statisticians in the U.S. move from weird variance estimation methods (grouped jackknife) to simple and straightforward (bootstrap)

and

2. when will they move from weird imputation methods with limited dimensionality and limited ability to assess the implicit model fit (hotdeck) to those where you explicitly model and understand which variables matter for this particular outcome (ICE)?

Oh and somebody reminded me of

3. when will we move from PROC STEPWISE to lasso as the rest of statistics world has

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Stas Kolenikov

Try unrolling a thread yourself!

More from @StatStas

Stas Kolenikov

Stas Kolenikov

Stas Kolenikov

Stas Kolenikov

Stas Kolenikov

Stas Kolenikov

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!