Profile picture
Melissa Kline @melissaekline
, 22 tweets, 6 min read Read on Twitter
Update: Psych-DS (do you like or hate this name? Please tell us!) has a repository! Do you care about reproducible analysis pipelines, and/or making scientific datasets discoverable by search engines/your favorite repository? We'd love to hear from you!

github.com/mekline/psych-…
If you would like to get involved, or if you'd like to hear from us when we release the finished version of the specification, let us know at this form:

goo.gl/forms/2dd6rouM…
By the way, I really mean it about getting involved. Please email me (mekline@mit.edu) if you are confused about anything at all. Although we are starting with a particular technical document, this is really a social/community project - we are trying to *build consensus* about
how we share datasets in the social sciences, which means we need to listen to each other & figure our what our common needs are. For individual scientists, the goal is to provide the tech infrastructure to take steps toward data standardization on their own. For organizations,
like repositories, search engines, libraries, large scientific organizations, and anyone else who works with datasets produced by scientists, the goal is to meet you halfway, by providing a layer of standardization that makes it easier to help scientists give you what you need.
Initial recommendation! If you really like software development or have strong opinions about JSON, read the specification draft. Otherwise, skip/skim it for now! In the future, there will be a step-by-step guide to format & move your files around so that you can do things like:
Share everything about a private dataset (variables, structure, even all materials and analytic code) except for the data itself, and be findable by people who might want to use it. (Like this tweet or the following examples if you are someone who'd like to be able to do this!)
(2) Put your dataset online, and get indexed by Google Dataset Search (toolbox.google.com/datasetsearch)
(3) Standardize how your lab formats datasets, so you have to spend less time re-naming columns, re-importing data, and re-trying to remember what your columns refer to.
(4) Run a collaborative experiment with fellow scientists, and clearly tell them how their data should look when they send it to you.

The ManyBabies analysis team is feeling this one hard!
(4.cont) Everyone sent us something sensible, but wouldn't it have been great if we had an app for people to run their datasets through to check before submitting?
(5) Search each other's datasets to learn more about a paper you read, do a metaanalysis, or find datasets for secondary projects. Someone asked on twitter a while ago for "Datasets that measure weight to 1/10 of a gram." Wouldn't it be cool if you could put that in a search bar?
OMG GRAMS ARE A UNIT OF MASS 😳😳😳 ⚖️ ⚖️ ⚖️
But to be clear - this specification won't involve deciding that the word 'gram' always refers to a unit of mass or to grammatical sentences, but it WILL make it possible to search for datasets with a variable named 'gram', and, we hope, provide the foundation for making those
... kinds of decisions as a field. This specification is closely based on the BIDS standard for fMRI data (bids.neuroimaging.io), and one of their great futures is the ability to build a 'derivative', a more domain-specific flavor that takes BIDS as a starting point to reach
further consensus about how to write down and document particular datasets so we can make fewer mistakes and write analyses that are more transparent to one another.
A final important point - we're very aware that we're entering an ecosystem of organizations that have been working on data management (in various capacities) for a long time. This means (1) there are a bunch of other orgs. you should know about if you care about these topics!
And (2), we are hoping to avoid reinventing the wheel or re-making any common mistakes these groups have encountered. What follows is a (truly un-comprehensive) list of organizations with related goals:
The NIH Data Commons, a pilot project to create a centralized cloud repository for biomedical data:

commonfund.nih.gov/commons

@nih_dcppc
Inter-university Consortium for Political and Social Research , which provides guidance and training on data management, and manages a huge archive of (often huge) datasets @ICPSR
icpsr.umich.edu/icpsrweb/conte…
Scientific repositories including @figshare, @OSFramework and @dataverseorg, which provide online homes for datasets and other project materials.
Organizations like Project TIER (@Project_TIER) and software/data carpentries (@thecarpentries) that provide training on good data practices and using software to make our science more reproducible
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Melissa Kline
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!