Happy Friday!! Today I'd like to describe two important approaches to data privacy research and applications: synthetic data and differential privacy. I hope to generate more interests in this area among researchers and practitioners!
1/n Data privacy and data confidentiality are important topics for statisticians, computer scientists, and really, anyone offers their own data and consume data!
2/n Statistical agencies, in particular, are under legal obligations to protect the privacy and confidentiality of survey and census respondents, e.g. U.S. Title 26.
3/n Therefore, statistical agencies are probably among the first who started working on methods of Statistical Disclosure Control (SDC), also known as Statistical Disclosure Avoidance and Statistical Disclosure Limitation.
4/n SDC methods are generally categorized in two: the ones for protecting tabular data and the ones for protecting microdata (i.e. respondent-level data).
5/n For tabular data, common SDC methods include cell suppression (i.e. not reporting if a cell count is smaller than certain values) and cell perturbation (i.e. adding noise to cell counts).
6/n For microdata, common SDC methods include adding random noise from a distribution centered at 0, topcoding (reporting value X for all records that are greater than X; there are also bottom coding and recoding), and swapping (swapping variables between two or more records).
7/n These SDC methods are great, however, they might reduce the usefulness of the resulting data, tabular data or microdata, that users might not get useful inferences based on them. Also, sometimes the protection level offered by these SDC methods might not be sufficient.
8/n There are two important evaluation aspects of releasing data after privacy protection procedures: #1. Utility, i.e. whether users can get similar inferential results using the released data; and #2. Risks, i.e. how much reduction of disclosure risks the procedures induce.
9/n Statisticians, more likely than not, first focus on the utility aspect. For microdata: If we are able to build statistical models on the confidential data and create predicted values for sensitive variables of each respondent, we can release those predicted values instead.
10/n Indeed, this utility-focused approach is what synthetic data does! Data disseminators build and estimate suitable models on the confidential data, and depending on their privacy protection goals, they can choose to synthesize some or all variables of some or all records.
11/n If the models are suitable and well estimated on the confidential data, the utility of the simulated synthetic data would be high. For example, a regression analysis done on the synthetic data might produce very similar results compared to that done on the confidential data.
12/n There are many utility metrics out there. For example, there are ones called global utility, measuring the distance between the confidential data distribution and the synthetic data distribution. And of course, the closer the distance, the higher the utility.
13/n There are also analysis-specific utility metrics, which focus on the specific analyses users might perform, and check how useful the synthetic data is. For example, we can compare the closeness / overlap of the C.I. of a regression coefficient, confidential vs synthetic.
14/n To add some references here for utility evaluation methods! Woo et al. (2009) in @JPrivConf, Snoke et al. (2018) in JRSS-A, and Karr et al. (2006) in The American Statistician. Note the analysis-specific utility is really very specific to the applications so be creative!
@JPrivConf 15/n Okay so we need to figure out how to evaluate the privacy protection level of the synthetic data, if we think its utility looks good and is satisfactory. Often, checking privacy protection level means checking the reduction of disclosure risks offered by the synthetic data.
@JPrivConf 16/n Two widely-used disclosure definitions are: #1. Identification disclosure (i.e. intruders identify a record of interest, say by matching with external databases), and #2. Attribute disclosure (i.e. intruders correctly infer the confidential values given the synthetic data).
@JPrivConf 17/n Therefore, if we are able to evaluate the risks associated with each type of disclosures, we can then decide whether the privacy protection level offered by the synthetic data is sufficient or not. Hu (2019) in Transactions on Data Privacy reviews ways for risk evaluation.
@JPrivConf 18/n So, as you can see, when are plan to disseminate synthetic data in place of confidential data, we need to make sure that we are not only satisfied with its utility, but also its privacy protection level, usually in terms of disclosure risks reduction.
@JPrivConf 19/n There is, in fact, a trade-off between utility and risks, which intuitively does make sense: if one synthetic dataset has high utility, it might resemble the confidential data too much, and therefore resulting in high disclosure risks, i.e. insufficient privacy protection.
@JPrivConf 20/n Therefore, data disseminators need to strike a balance between utility and risks, and you simply cannot maximize both at the same time. This is actually true for differential privacy too, as we will discuss soon.
@JPrivConf 21/n So how can we make synthetic data? Many synthetic data approaches are Bayesian modeling, since synthetic data can be generated from the posterior predictive distribution of the model parameters. If the Bayesian models are well specified and estimated, high utility follows.
@JPrivConf 22/n In fact, these Bayesian approaches stemmed from multiple imputation (MI) for missing data literature in the early 1990s. Check out Rubin (1993) and Little (1993) in Journal of Official Statistics. Reiter and Raghunathan (2007) in JASA discussed multiple adaptations of MI.
@JPrivConf 24/n Moreover, data disseminators can create partially synthetic data, where only a subset of variables are sensitive and therefore synthesized, or fully synthetic data, where all variables are sensitive and therefore synthesized. It really depends on the protection goals!
@JPrivConf 26/n There are also the IAB Establishment Panel in Germany (iab.de/en/erhebungen/…) and synthetic business microdata disseminated by the Canadian Research Data Centre Network (crdcn.org/data).
@JPrivConf 27/n So synthetic data is so wonderful, why not everyone is making it? Its drawback is the disclosure risk evaluation of synthetic data often relies on assumptions about intruder's knowledge and behavior. That is, a different assumption might present a different risk profile.
@JPrivConf 28/n Now let's move to differential privacy (DP), which provides a mathematical framework with proven privacy protection guarantee. That is, synthetic data starts with maximizing utility, whereas DP starts with a pre-defined privacy guarantee. Both are great!
@JPrivConf The DP framework focuses on neighboring databases (most often two databases differing by one record), and privatize outputs from the database to be protected, in a way that information about that differing record won't be leaked.
@JPrivConf 30/n One typical DP approach is to add noise to privatize the output. To determine the amount of noise to be added, we need to consider two pieces: #1. How sensitive that output is to changes of one differing record, and #2. The level of privacy guarantee we want to achieve.
@JPrivConf 31/n The first piece can be expressed as sensitivity of the output, while the second piece is about the selected privacy parameters, e.g. epsilon for epsilon-DP, and epsilon and delta for (epsilon, delta)-DP. There are many variations of DP, as you can see!
@JPrivConf 32/n There are many DP mechanisms out there, and more are appearing constantly, that can achieve DP for various outputs. e.g., Laplace Mechanism, Gaussian Mechanism, Exponential Mechanism. Check out the book of Dwork and Roth (2014) for details of DP algorithms with examples!
@JPrivConf 33/n For synthetic data, we know there is a trade-off between utility and risk. It is generally true for DP algorithms too. For example, when adding noise to an output, the higher the privacy protection, the larger the amount of noise needs to be added, leading to lower utility.
@JPrivConf 34/n Since DP comes from the risk / privacy perspective and checks utility afterwards, many new DP approaches and algorithms are focused on improving the utility of the DP outputs. If you want to play with DP algorithms, check out Ben Rubinstein's R code: github.com/brubinstein/di…
@JPrivConf 35/n Given that synthetic data focuses on maximizing utility and DP focuses on formally controlling the risk, would it make sense to marry these two? That is, to produce differentially private synthetic data? YES! Many researchers, especially statisticians, are working on it!
@JPrivConf 36/n There are quite a number of DP synthetic data approaches out there already, so let me just post Bowen and Snoke (2021) in @JPrivConf, who did a comparative study of methods competed in the NIST PSCR Data Challenge: journalprivacyconfidentiality.org/index.php/jpc/…
@JPrivConf 37/n In the Bayesian space, Dimitrakakis et al. (2017) in JMLR and Wang et al. (2015) in ICML investigated how posterior sampling links to DP. They are the first steps to creating DP synthetic data based on Bayesian methods.
@JPrivConf 38/38 I hope this thread has made you more interested in statistical data privacy! It would be great to see more researchers and practitioners in this ever-more important field. Also, please feel free to add your tweets about this topic!
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Happy Thursday! Today, I'd like to introduce and discuss various approaches, innovations, and resources for introducing Bayesian statistics to the undergraduates! I am sure I will miss something good, so feel free to add yours or the ones you know.
First, a little bit history. Bayesian methods became widely used, thanks to the computational advances in early 1990s, including the Gibbs sampler and Metropolis Hastings algorithms (e.g. Gelfand and Smith (1990)).
However, even before that revolutionary advance, innovative educators had designed ways to introduce Bayes to students: e.g. emphasizing the intuition on specifying prior for a data analysis problem while relying on numerical integration, Franck et al. (1988).
Let’s talk vectorization! You may have heard about or experienced how simple NumPy array ops (such as dot product) run significantly faster than for loops or list comprehension in Python. How? Why? Thread incoming.
Suppose we are doing a dot product on two n-dim vectors. In a Python for loop, scalars are individually loaded into registers, and operations are performed on the scalar level. Ignoring the sum, this gives us n multiplication operations.
NumPy makes this faster by employing vectorization, where you can load multiple scalars into registers and get many products for the price of one operation (SIMD). SIMD — single instruction, multiple data — is a backbone of NumPy vectorization.
Today I will be talking about some of the data structures we use regularly when doing data science work. I will start with numpy's ndarray.
What is an ndarray? It's numpy's abstraction for describing an array, or a group of numbers. In math terms, arrays are a "catch all" term used to describe matrices or vectors. Behind the scenes, it essentially describes memory using several key attributes:
* pointer: the memory address of the first byte in the array
* type: the kind of elements in the array, such as floats or ints
* shape: the size of each dimension of the array (ex: 5 x 5 x 5)
* strides: number of bytes to skip to proceed to the next element
* flags
During my leave I’ve really enjoyed reading about the inspiring women trailblazers in statistics who paved the way for us. Here are some of my favourite quotes in chronological order. Please share yours! #WSDS
Florence Nightingale states in her essay Cassandra 👇
🖼 source: Wikimedia commons
I’m really looking forward to attending this 👇 #Nightingale2020 has been one of the few things worth celebrating this year! Her lessons on sanitation couldn’t be more relevant. #WSDS
As part of the bicentennary celebrations of the birth of the first @RoyalStatSoc woman elected fellow, at the society we’ve also organised several events throughout the year rss.org.uk/news-publicati…
Support mechanisms for students and early career researchers have become ever so important during the pandemic, yet more difficult to provide.
🖼️Another beautiful and on-point creation by @allison_horst
@allison_horst As a consequence, the power and potential of the support they receive from online communities like this one have been strengthened by the circumstances. I have personally valued them more than ever.
@allison_horst When I registered to curate this account earlier in the year I didn’t know there was going to be either a pandemic or elections. I just thought it would be a nice way to return to work after extended maternal leave, and a great way to get my confidence & stats interests back.