Tweet

Women in Statistics and Data Science

5 Mar, 39 tweets, 13 min read

Happy Friday!! Today I'd like to describe two important approaches to data privacy research and applications: synthetic data and differential privacy. I hope to generate more interests in this area among researchers and practitioners!

1/n Data privacy and data confidentiality are important topics for statisticians, computer scientists, and really, anyone offers their own data and consume data!

2/n Statistical agencies, in particular, are under legal obligations to protect the privacy and confidentiality of survey and census respondents, e.g. U.S. Title 26.

3/n Therefore, statistical agencies are probably among the first who started working on methods of Statistical Disclosure Control (SDC), also known as Statistical Disclosure Avoidance and Statistical Disclosure Limitation.

4/n SDC methods are generally categorized in two: the ones for protecting tabular data and the ones for protecting microdata (i.e. respondent-level data).

5/n For tabular data, common SDC methods include cell suppression (i.e. not reporting if a cell count is smaller than certain values) and cell perturbation (i.e. adding noise to cell counts).

6/n For microdata, common SDC methods include adding random noise from a distribution centered at 0, topcoding (reporting value X for all records that are greater than X; there are also bottom coding and recoding), and swapping (swapping variables between two or more records).

7/n These SDC methods are great, however, they might reduce the usefulness of the resulting data, tabular data or microdata, that users might not get useful inferences based on them. Also, sometimes the protection level offered by these SDC methods might not be sufficient.

8/n There are two important evaluation aspects of releasing data after privacy protection procedures: #1. Utility, i.e. whether users can get similar inferential results using the released data; and #2. Risks, i.e. how much reduction of disclosure risks the procedures induce.

9/n Statisticians, more likely than not, first focus on the utility aspect. For microdata: If we are able to build statistical models on the confidential data and create predicted values for sensitive variables of each respondent, we can release those predicted values instead.

10/n Indeed, this utility-focused approach is what synthetic data does! Data disseminators build and estimate suitable models on the confidential data, and depending on their privacy protection goals, they can choose to synthesize some or all variables of some or all records.

11/n If the models are suitable and well estimated on the confidential data, the utility of the simulated synthetic data would be high. For example, a regression analysis done on the synthetic data might produce very similar results compared to that done on the confidential data.

12/n There are many utility metrics out there. For example, there are ones called global utility, measuring the distance between the confidential data distribution and the synthetic data distribution. And of course, the closer the distance, the higher the utility.

13/n There are also analysis-specific utility metrics, which focus on the specific analyses users might perform, and check how useful the synthetic data is. For example, we can compare the closeness / overlap of the C.I. of a regression coefficient, confidential vs synthetic.

@JPrivConf

14/n To add some references here for utility evaluation methods! Woo et al. (2009) in @JPrivConf, Snoke et al. (2018) in JRSS-A, and Karr et al. (2006) in The American Statistician. Note the analysis-specific utility is really very specific to the applications so be creative!

@JPrivConf

@JPrivConf 15/n Okay so we need to figure out how to evaluate the privacy protection level of the synthetic data, if we think its utility looks good and is satisfactory. Often, checking privacy protection level means checking the reduction of disclosure risks offered by the synthetic data.

@JPrivConf

@JPrivConf 16/n Two widely-used disclosure definitions are: #1. Identification disclosure (i.e. intruders identify a record of interest, say by matching with external databases), and #2. Attribute disclosure (i.e. intruders correctly infer the confidential values given the synthetic data).

@JPrivConf

@JPrivConf 17/n Therefore, if we are able to evaluate the risks associated with each type of disclosures, we can then decide whether the privacy protection level offered by the synthetic data is sufficient or not. Hu (2019) in Transactions on Data Privacy reviews ways for risk evaluation.

@JPrivConf

@JPrivConf 18/n So, as you can see, when are plan to disseminate synthetic data in place of confidential data, we need to make sure that we are not only satisfied with its utility, but also its privacy protection level, usually in terms of disclosure risks reduction.

@JPrivConf

@JPrivConf 19/n There is, in fact, a trade-off between utility and risks, which intuitively does make sense: if one synthetic dataset has high utility, it might resemble the confidential data too much, and therefore resulting in high disclosure risks, i.e. insufficient privacy protection.

@JPrivConf

@JPrivConf 20/n Therefore, data disseminators need to strike a balance between utility and risks, and you simply cannot maximize both at the same time. This is actually true for differential privacy too, as we will discuss soon.

@JPrivConf

@JPrivConf 21/n So how can we make synthetic data? Many synthetic data approaches are Bayesian modeling, since synthetic data can be generated from the posterior predictive distribution of the model parameters. If the Bayesian models are well specified and estimated, high utility follows.

@JPrivConf

@JPrivConf 22/n In fact, these Bayesian approaches stemmed from multiple imputation (MI) for missing data literature in the early 1990s. Check out Rubin (1993) and Little (1993) in Journal of Official Statistics. Reiter and Raghunathan (2007) in JASA discussed multiple adaptations of MI.

@JPrivConf

@JPrivConf 23/n There are also non-Bayesian methods for creating synthetic data. A super handy source is the synthpop R package: cran.r-project.org/web/packages/s…

@JPrivConf

@JPrivConf 24/n Moreover, data disseminators can create partially synthetic data, where only a subset of variables are sensitive and therefore synthesized, or fully synthetic data, where all variables are sensitive and therefore synthesized. It really depends on the protection goals!

@JPrivConf

@JPrivConf 25/n In fact, several statistical agencies in the world have published synthetic data products for public use. In the US, the Census Bureau has OnTheMap (onthemap.ces.census.gov) and synLBD (census.gov/programs-surve…).

@JPrivConf

@JPrivConf 26/n There are also the IAB Establishment Panel in Germany (iab.de/en/erhebungen/…) and synthetic business microdata disseminated by the Canadian Research Data Centre Network (crdcn.org/data).

@JPrivConf

@JPrivConf 27/n So synthetic data is so wonderful, why not everyone is making it? Its drawback is the disclosure risk evaluation of synthetic data often relies on assumptions about intruder's knowledge and behavior. That is, a different assumption might present a different risk profile.

@JPrivConf

@JPrivConf 28/n Now let's move to differential privacy (DP), which provides a mathematical framework with proven privacy protection guarantee. That is, synthetic data starts with maximizing utility, whereas DP starts with a pre-defined privacy guarantee. Both are great!

@JPrivConf

@JPrivConf The DP framework focuses on neighboring databases (most often two databases differing by one record), and privatize outputs from the database to be protected, in a way that information about that differing record won't be leaked.

@JPrivConf

@JPrivConf 30/n One typical DP approach is to add noise to privatize the output. To determine the amount of noise to be added, we need to consider two pieces: #1. How sensitive that output is to changes of one differing record, and #2. The level of privacy guarantee we want to achieve.

@JPrivConf

@JPrivConf 31/n The first piece can be expressed as sensitivity of the output, while the second piece is about the selected privacy parameters, e.g. epsilon for epsilon-DP, and epsilon and delta for (epsilon, delta)-DP. There are many variations of DP, as you can see!

@JPrivConf

@JPrivConf 32/n There are many DP mechanisms out there, and more are appearing constantly, that can achieve DP for various outputs. e.g., Laplace Mechanism, Gaussian Mechanism, Exponential Mechanism. Check out the book of Dwork and Roth (2014) for details of DP algorithms with examples!

@JPrivConf

@JPrivConf 33/n For synthetic data, we know there is a trade-off between utility and risk. It is generally true for DP algorithms too. For example, when adding noise to an output, the higher the privacy protection, the larger the amount of noise needs to be added, leading to lower utility.

@JPrivConf

@JPrivConf 34/n Since DP comes from the risk / privacy perspective and checks utility afterwards, many new DP approaches and algorithms are focused on improving the utility of the DP outputs. If you want to play with DP algorithms, check out Ben Rubinstein's R code: github.com/brubinstein/di…

@JPrivConf

@JPrivConf 35/n Given that synthetic data focuses on maximizing utility and DP focuses on formally controlling the risk, would it make sense to marry these two? That is, to produce differentially private synthetic data? YES! Many researchers, especially statisticians, are working on it!

@JPrivConf

@JPrivConf 36/n There are quite a number of DP synthetic data approaches out there already, so let me just post Bowen and Snoke (2021) in @JPrivConf, who did a comparative study of methods competed in the NIST PSCR Data Challenge: journalprivacyconfidentiality.org/index.php/jpc/…

@JPrivConf

@JPrivConf 37/n In the Bayesian space, Dimitrakakis et al. (2017) in JMLR and Wang et al. (2015) in ICML investigated how posterior sampling links to DP. They are the first steps to creating DP synthetic data based on Bayesian methods.

@JPrivConf

@JPrivConf 38/38 I hope this thread has made you more interested in statistical data privacy! It would be great to see more researchers and practitioners in this ever-more important field. Also, please feel free to add your tweets about this topic!

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Women in Statistics and Data Science

Try unrolling a thread yourself!

More from @WomenInStat

Women in Statistics and Data Science

Women in Statistics and Data Science

Women in Statistics and Data Science

Women in Statistics and Data Science

Women in Statistics and Data Science

Women in Statistics and Data Science

Did Thread Reader help you today?

Like this author's thread?