, 15 tweets, 6 min read Read on Twitter
NEW PREPRINT 🎉 Synthetic datasets: A primer psyarxiv.com/dmfb3/

By sharing synthetic datasets that mimic original datasets that could not otherwise be made open, researchers can ensure the reproducibility of their results while maintaining participant privacy

[THREAD]
Openly accessible biomedical research data provides ENORMOUS utility for science and society. With open data, scholars can verify results, generate new knowledge, form new hypotheses, and reduce the unnecessary duplication of data collection.
Researchers who wish to share data while reducing the risk of disclosure have traditionally used data anonymization procedures to mask identities, in which explicit identifiers such as names, addresses, and national identity numbers are removed.
But despite these anonymization efforts, specific individuals can still be identified in anonymized datasets with high accuracy (given certain information), which we saw with this recent @NatureComms paper nature.com/articles/s4146…
To reduce these risk, people have used data aggregation and the inclusion of random noise. But these methods can also distort the relationships between variables in the dataset, which hinders replication and exploration of the dataset
Synthetic datasets can substantially overcome replicability issues, as this method creates a new dataset that mimics an original dataset by preserving its statistical properties and relationships between variables. These can made using the 'synthpop' #Rstats package
So this means that you run the an analysis on the synthetic data and get almost exactly the same result as if you were run that same analysis in the original data, despite the fact that no record in the synthetic dataset represents a "real" individual.
It's important to assess how closely your synthetic dataset mimics your original data, in other words, the *utility* of your synthetic data. First, you look at your overall general utility by examining the distributions of your variables between your synthetic and original data
Here's an example from my paper with four variables. The counts are very similar between the observed and the synthetic data, so we can be confident that the synthetic data has good general utility
Second, you need to examine the specific utility of your reported outcomes. In other words, are your model estimates similar in the synthetic and original data? So if someone was to run your script on your data, how close would the results be compared to the original?
To examine this, you can perform a lack-of-fit test can for the overall model and compare standardised coefficient estimates. Figure B has a 99.94% CI overlap, so this synthetic model has done well. The other models perform quite well too, with large overlaps.
There are two conditions in which the risk of disclosure increases: if a synthetic data record match with the original data record and if there's an extreme single individual value in the dataset that can be linked to a person. These can be checked using functions in synthpop
One criticism of sharing data is that researchers would not be have the first opportunity to analyse the all their data (i.e., "lost" papers). With synthetic data, others would *have* to verify outcomes with original authors, which would help ensure shared papers
Many papers note data is "available upon request", but such data are often difficult to retrieve due to unreachable authors or lost datasets journals.plos.org/plosone/articl… Synthetic data would increase reproducibility as they can be shared with the paper if there's privacy concerns
I've posted the #Rstats code with the paper, so you can follow along with my analysis. You can also download my example data & reproduce my figures osf.io/z524n/

I would love to get feedback on how I can improve this paper or if there are any errors before I submit
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Dan Quintana
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!