/1. Yesterday at the ACS Data Users Conference, the Census Bureau described its plans to replace the American Community Survey (ACS) microdata with “fully synthetic” data over the next three years.
/2. Details of the methodology have not been disclosed, but the idea is to develop models describing the interrelationships of all the variables in the ACS, and then construct a simulated population consistent with those models.
/3. Such modeled data captures relationships between variables only if they have been intentionally included in the model. Accordingly, synthetic data are poorly suited to studying unanticipated relationships, which impedes new discovery.
/4. The large size of the ACS means that it is possible to study small population subgroups, but the synthetic data cannot capture all the ways in which interrelationships among variables can vary across subgroups.
/5. For example, the synthetic data would certainly incorporate a general relationship between income and education but could not assess that relationship separately for every possible subgroup.
/6. The relationship of income and education might be different for American Indians in South Dakota or Asian Indians in Queens.
/7. The power of ACS microdata in large measure derives from their hierarchical structure: individuals are nested in households, and the interrelationships of household members are known.
/8. This allows analysis of millions of potential associations across household members. For example, investigators can measure ethnic intermarriage, or the impact of a partner’s education on women’s fertility.
/9. The synthetic data apparently incorporates only individual-level interrelationships among variables, so analysis across household members will be impossible.
/10. These limitations are important because the ACS microdata is the most intensively used source available for demographic and economic research.
/11. Hundreds of thousands of academic researchers, planners, and policy makers rely on the ACS, and according to Google Scholar they generate about 12,000 publications per year.
/12. Common topics of analysis include poverty, inequality, immigration, internal migration, ethnicity, residential segregation, disability, transportation, fertility, nuptiality, occupational structure, education, and family change.
/13. If public use data become unusable or inaccessible because of overzealous disclosure control, there will be far-reaching consequences. The quantity and quality of research about U.S. policies, the economy, and social structure will decline precipitously.
/14. The Census Bureau appears to recognize that synthetic data will be inappropriate for most research purposes.
/15. The Census Bureau proposes a system whereby investigators would develop analyses using synthetic data, and then submit then to the Census Bureau for “validation” using real data.
/16. One problem is that investigators need access to the real data for exploratory analysis to discover the relevant variables to incorporate in their analyses.
/17. Another problem is logistical: The Census Bureau is not equipped or funded to carry out the tens or hundreds of thousands of validation analyses that would be needed to replace current usage.
/18. And the results of the validation runs would then have to go through disclosure review, and the Census Bureau also lacks the capacity to do that work at scale.
/19. The reason the Census Bureau wants to get rid of one of the world’s most intensively used scientific resources is concern about respondent confidentiality.
/20. The Census Bureau implicitly acknowledges that there is not a single documented case of reidentification of a respondent in the ACS or decennial census microdata.
/21. Over 100 countries around disseminate similar microdata through @ipums, and again there is not a single documented case in which respondent’s identity has been revealed.
/22. In the presentation yesterday, the Bureau maintained that
/23. Not only are the risks of disclosure unmeasurably infinitesimal, if by some miracle someone’s ACS data were exposed the resulting harms would be minimal.
/24. The ACS has no information that could be used to aid identity theft, and most of the information it does include could far more easily be obtained from other sources.
/25. If we weigh the profound cost of eliminating the ACS microdata against fanciful benefits for respondent confidentiality, the Census Bureau has no case.
/26. Such a massive shift in the Nation’s statistical infrastructure would be "arbitrary and capricious, an abuse of discretion” and therefore in violation of the Administrative Procedures Act.
/27. Although some Census Bureau staff members treat the synthetic ACS as if it were a done deal, there is still time to avert this disastrous course.
/28. Acting Census Director @jarmin_ron or Census Director nominee @_Rob_Santos may decide to back away from the precipice.
/29. It may be necessary, however, for the research community to pursue political or legal strategies to retain open access to the crown jewels of demographic data infrastructure.
/30. To stay informed as things develop, watch this space. We will also post updates as we learn more at ipums.org/changes-to-cen…. /fin.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Steven Ruggles

Steven Ruggles Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @HistDem

19 May
/1. @samwang misinterprets the second declaration of John Abowd in Alabama v. Department of Commerce.
/2. Abowd states that in tiny blocks, if you “reconstruct” age and it matches someone who lives on the on the block in the commercial database, and then look up the names of those people in the census, the census recorded the same people 72.24% off the time.
/2. Everyone on the block in the commercial database ought to be found on the same block in the census.
Read 13 tweets
15 May
/1. The Census Bureau plans to add intentional errors to the 2020 census to protect the confidentiality of census respondents. The Census Bureau insists that the intentional error is necessary to combat the threat of “database reconstruction.”
/2. Database reconstruction is a process for inferring individual-level responses from tabular data. The Chief Scientist of the Census Bureau asserts that database reconstruction “is the death knell for traditional data publication.”
/3. To demonstrate the threat Census conducted a database reconstruction experiment that attempted to infer the age, sex, race, and Hispanic or Non-Hispanic ethnicity for every individual in each of the 6.3 million inhabited census blocks in the 2010 census.
Read 20 tweets
20 Apr
1.I prepared a report for the Plaintiffs in the Alabama v. Department of Commerce lawsuit over differential privacy in the census, available here: users.hist.umn.edu/~ruggles/censi…
2.I argue that the database reconstruction experiment did not demonstrate a convincing threat to confidentiality, because the results reported by the Census Bureau can be largely explained by chance.
3. Any randomly-chosen age-sex combination would be expected to be found on any given block more than 50% of the time.
Read 9 tweets
5 Jul 19
What we have learned about the Census Bureau’s implementation of differential privacy.

In September 2020, the Census Bureau announced new confidentiality standards that mark a “sea change for the way that official statistics are produced and published.” 1/
The new system, known as Differential Privacy (DP), will be applied first to 2020, and “will then be adapted to protect publications from the American Community Survey and eventually all of our statistical releases.” 2/
I am increasingly convinced that DP will degrade the quality of data available about the population, and will make scientifically useful public use microdata impossible. 3/
Read 26 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(