/1. The Census Bureau plans to add intentional errors to the 2020 census to protect the confidentiality of census respondents. The Census Bureau insists that the intentional error is necessary to combat the threat of “database reconstruction.”
/2. Database reconstruction is a process for inferring individual-level responses from tabular data. The Chief Scientist of the Census Bureau asserts that database reconstruction “is the death knell for traditional data publication.”
/3. To demonstrate the threat Census conducted a database reconstruction experiment that attempted to infer the age, sex, race, and Hispanic or Non-Hispanic ethnicity for every individual in each of the 6.3 million inhabited census blocks in the 2010 census.
/4. Prior to April 2021, the Census Bureau’s database reconstruction experiment was documented solely in tweets and PowerPoint slides that provided few details, so it was difficult for outsiders to evaluate.
/5. In conjunction with recent legal proceedings, the Census Bureau’s chief scientist has now released a more detailed description of the experiment, and this opens new opportunities to appraise the results.
/6. Using 6.2 billion statistics from nine tables published as part of the 2010 census, the Census Bureau constructed a system of simultaneous equations consistent with the published tables, and solved the system using Gurobi linear programming software.
/7. The “reconstructed” data produced by the experiment consists of rows of data identifying the age, sex, and race/ethnicity for each person in a hypothetical population of each census block.
/8. The Census Bureau found that for 53.52% of their hypothetical population, there was not a single case in the real population that matched on block, age, sex, and race/ethnicity. There was at least one person who matched on all characteristics in 46.48% of cases.
/9. In a new working paper, @dvanriper and I use a Monte Carlo simulation to assess how many matches would be expected purely by chance. users.hist.umn.edu/~ruggles/Artic…
/10. If we assign age and sex randomly and then assign everyone on each block the modal race and ethnicity of each block, we estimate a match rate of 40.9%, which is almost as good as the Census Bureau obtained in their elaborate database reconstruction.
/11. This newly released graph from the Census shows the relationship between match rate and block size. As our analysis shows, the higher match rate in large blocks would be expected if the matches were occurring through chance alone.
/12. Perhaps the most surprising thing about the graph is the low match rate for very small blocks. It looks like the database reconstruction exactly matches the “true” just over 20% of the time, suggesting an error rate of almost 80%.
/13. This is surprising because database reconstruction ought to work best with the smallest blocks. Even with a “fuzzy” match allowing a one-year error in age the match rate is just over 40%.
/14. This is exceptionally poor performance, because one of the tables used in database reconstruction (P12a-i) provides age in five-year groups by sex by ethnicity.
/15. It is trivial to convert that table into individual-level information on age in five-year groups, sex, race, and ethnicity; my colleagues and I explain how to do it in this working paper, p. 11. assets.ipums.org/_files/mpc/wp2…
/16. So the published data gave them everything they needed for the database reconstruction except for exact age. All they had to do is guess the exact age within the five-year group, and they would be done, with no need to resort to simultaneous equations or Gurobi.
/17. Let’s see. If we were just to randomly guess exact age given that we know the five-year age group, how often would that come out right? Oh, right. 1-in-5, or 20%--just about the result they got.
/18. Suppose you had a one-year margin of error. How often could you guess correctly within a five-year age group? It should be 3-in-5, or 60%. How could they do so badly, and only get a 40% fuzzy match rate? That fuzzy match rate is way worse than you would get by chance.
/19. It could be they perform so badly because of swapping. They could easily measure how much is due to swapping but are unlikely do so because it would demonstrate the effectiveness of traditional statistical disclosure control, and they are trying to prove the opposite.
/20. The database reconstruction experiment was a failure because the results are largely what would be expected by chance. It cannot possibly justify introduction of deliberate error into the Census. /fin

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Steven Ruggles

Steven Ruggles Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @HistDem

20 Apr
1.I prepared a report for the Plaintiffs in the Alabama v. Department of Commerce lawsuit over differential privacy in the census, available here: users.hist.umn.edu/~ruggles/censi…
2.I argue that the database reconstruction experiment did not demonstrate a convincing threat to confidentiality, because the results reported by the Census Bureau can be largely explained by chance.
3. Any randomly-chosen age-sex combination would be expected to be found on any given block more than 50% of the time.
Read 9 tweets
5 Jul 19
What we have learned about the Census Bureau’s implementation of differential privacy.

In September 2020, the Census Bureau announced new confidentiality standards that mark a “sea change for the way that official statistics are produced and published.” 1/
The new system, known as Differential Privacy (DP), will be applied first to 2020, and “will then be adapted to protect publications from the American Community Survey and eventually all of our statistical releases.” 2/
I am increasingly convinced that DP will degrade the quality of data available about the population, and will make scientifically useful public use microdata impossible. 3/
Read 26 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(