Read on Twitter

12,399 views

Dan Quintana

@dsquintana

, 15 tweets, 6 min read Read on Twitter

NEW PREPRINT 🎉 Synthetic datasets: A primer psyarxiv.com/dmfb3/

By sharing synthetic datasets that mimic original datasets that could not otherwise be made open, researchers can ensure the reproducibility of their results while maintaining participant privacy

[THREAD]

Openly accessible biomedical research data provides ENORMOUS utility for science and society. With open data, scholars can verify results, generate new knowledge, form new hypotheses, and reduce the unnecessary duplication of data collection.

Researchers who wish to share data while reducing the risk of disclosure have traditionally used data anonymization procedures to mask identities, in which explicit identifiers such as names, addresses, and national identity numbers are removed.

@NatureComms

@NatureComms

But despite these anonymization efforts, specific individuals can still be identified in anonymized datasets with high accuracy (given certain information), which we saw with this recent @NatureComms paper nature.com/articles/s4146…

To reduce these risk, people have used data aggregation and the inclusion of random noise. But these methods can also distort the relationships between variables in the dataset, which hinders replication and exploration of the dataset

Synthetic datasets can substantially overcome replicability issues, as this method creates a new dataset that mimics an original dataset by preserving its statistical properties and relationships between variables. These can made using the 'synthpop' #Rstats package

So this means that you run the an analysis on the synthetic data and get almost exactly the same result as if you were run that same analysis in the original data, despite the fact that no record in the synthetic dataset represents a "real" individual.

It's important to assess how closely your synthetic dataset mimics your original data, in other words, the *utility* of your synthetic data. First, you look at your overall general utility by examining the distributions of your variables between your synthetic and original data

Here's an example from my paper with four variables. The counts are very similar between the observed and the synthetic data, so we can be confident that the synthetic data has good general utility

Second, you need to examine the specific utility of your reported outcomes. In other words, are your model estimates similar in the synthetic and original data? So if someone was to run your script on your data, how close would the results be compared to the original?

To examine this, you can perform a lack-of-fit test can for the overall model and compare standardised coefficient estimates. Figure B has a 99.94% CI overlap, so this synthetic model has done well. The other models perform quite well too, with large overlaps.

There are two conditions in which the risk of disclosure increases: if a synthetic data record match with the original data record and if there's an extreme single individual value in the dataset that can be linked to a person. These can be checked using functions in synthpop

One criticism of sharing data is that researchers would not be have the first opportunity to analyse the all their data (i.e., "lost" papers). With synthetic data, others would *have* to verify outcomes with original authors, which would help ensure shared papers

Many papers note data is "available upon request", but such data are often difficult to retrieve due to unreachable authors or lost datasets journals.plos.org/plosone/articl… Synthetic data would increase reproducibility as they can be shared with the paper if there's privacy concerns

I've posted the #Rstats code with the paper, so you can follow along with my analysis. You can also download my example data & reproduce my figures osf.io/z524n/

I would love to get feedback on how I can improve this paper or if there are any errors before I submit

Like this thread? Get email updates or save it to PDF!

Subscribe to Dan Quintana

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Like this thread? Get email updates or save it to PDF!

Subscribe to Dan Quintana

This content may be removed anytime!

Try unrolling a thread yourself!

Related hashtags

More from @dsquintana see all

Related threads

Trending hashtags

Did Thread Reader help you today?