, 55 tweets, 35 min read Read on Twitter
At #SDSS2019, I am chairing a session on workflows with @TiffanyTimbers @mikelove @stephaniehicks at 3:45 in Regency C - it is tucked away a bit in the corridor between AB and D
#SDSS2019 @mikelove #DataScience = act of extracting value from data using reproductive processes (coding, accesses to data, data transformations, organizing data, code, documents, managing dependencies)
#SDSS2019 @mikelove DS Workflow =
- literate programming
- analysis scripts
- pipelines
- environment management tools
#SDSS2019 @mikelove Literate Programming (Donald Knuth) “Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”
#SDSS2019 @mikelove so how to get that stuff published in academic journals? Especially given that "The Scientific Paper is Obsolete" theatlantic.com/science/archiv…
#SDSS2019 @mikelove goes into the notebook that analyzed gravitational waves -- code chunks become kinda big with a bit of commenting within the chunk
#SDSS2019 @mikelove Ten simple rules: human language is not as precise as code; OTOH the code may be difficult to read -- notebooks are in between as they link code and prose.
#SDSS2019 @mikelove quotes @rpeng on reproducibility, repeats his warning that availability of the tools only means you can rerun somebody else's code, note that the results are accurate, scientifically valid, relevant, etc.
#SDSS2019 @mikelove ways to academically publish workflows
- #RMarkdown -> overleaf -> F1000Research
- Jupyter Notebook -> supplement along with the paper
- #CodeOcean capsule
- eLife (in development -- The First Reproducible Paper)
#SDSS2019 @mikelove talks about @F1000Research: bioconductoR is one of the channels; out of 52 articles published, 30 are workflows. library(BiocWorkflowTools) to help within R. The data must be within your R package that you submit, so they appear as system.file() == pain point
#SDSS2019 @mikelove think of workflow as a protocol paper; difficult to obtain reviewers and get reviews out of them; mutual benefits between journals organizing the review and Bioc in providing markdown, build, etc. compute support
#SDSS2019 @mikelove library(BiocWorkflowTools) provides an #RMarkdown template, rendering to overleaf and upload to F1000. Mike had ~48 hours turnaround on the initial submission.
#SDSS2019 @mikelove explains @F1000Research process: open review; approved / approved with reservations / not approved; 2 "approves" to get indexed; history of the review and publication is available as the submission metadata
#SDSS2019 @mikelove the code is run by Bioc as a part of nightly check
#SDSS2019 @mikelove shows some submission statistics #ggplot -- more reviewers helps; more pages doesn't
#SDSS2019 @jo_hardin asks why overleaf -- @mikelove it is required by F1000
#SDSS2019 @TiffanyTimbers complex projects: multiple people, multiple files, convoluted package dependencies -> interesting result cannot recreate; only you have access to email; rerunning of any small thing takes hours; there's code that runs on only one machine, and you DK why
#SDSS2019 @TiffanyTimbers workflow features: version control, exec analysis scripts and pipelines, defined and shippable dependencies -- owing to Hillary Parker's opinionated analysis development peerj.com/preprints/3210/
#SDSS2019 @TiffanyTimbers go in back in time via commits -- commit your results, too!
define "releases" as important checkpoints in time of your project
#SDSS2019 @TiffanyTimbers thousand emails in your inbox about the project -> use GitHub issues to communicate about the project (GitHub can notify you by email if not git-savvy); you don't lose emails / access to emails; close issue \approx archive email
#SDSS2019 @TiffanyTimbers executable analysis scripts and pipelines: when you have way too much code, you need to convert your huge chunks to scripts performing well defined tasks
#SDSS2019 @TiffanyTimbers activation time to becoming productive after a break -> record the order of scripts in the master file
- a shell script
- makefile
#SDSS2019 @TiffanyTimbers annoyingly large data that takes hours to rerun -> use smart dependency tool to only rerun the affected parts
- GNU make
- snakemake (for Python)
- Nextflow
(SK: library(drake) )
(SK: #Stata -project-)
#SDSS2019 @TiffanyTimbers democratizing #datascience
1. anybody can recreate your analysis
2. one can better see how the bits of your analysis relate to one another
#SDSS2019 @TiffanyTimbers defined and shippable dependencies: programming languages; external packages used; other tools (e.g. make); legacy code -- what it is, and what version it is
#SDSS2019 @TiffanyTimbers code only runs on one machine, you don't know why -> containers: lightweight virtual machines with specific settings and specific startup scripts/options -- most popular is @Docker
#SDSS2019 @TiffanyTimbers
1. Install Docker
2. clone github repo
3. run one single docker command
hub.docker.com/r/ttimbers/dat…
#SDSS2019 @TiffanyTimbers when to use what?
1. version control: ALWAYS
2. executable analysis/scripts: when you start hiding chunks in rmd/ipynb
3. shippable dependencies: remote computing, tricky dependencies
#SDSS2019 @tiffanytimbers 3a. also helpful when you work on AWS and don't want to pay for the archives between your iterations of the project
#SDSS2019 @jo_hardin47 so what if I don't have R, what would your docker do for me? @TiffanyTimbers it will download R on your computer; for AWS it will be way faster. (It actually installs an image that has R in it.)
#SDSS2019 I took a liberty to let Will Landau give a two minute spiel on library(drake)
#SDSS2019 @stepaniehicks Useful Tools for Teaching and Outreach
#SDSS2019 @stepaniehicks will focus on literate code documents, especially for teaching and outreach
#SDSS2019 @stepaniehicks the power of case study: use one's own reasoning to see how the principles (of law, in SH's references to the original concept of a case study as defined in 1860s in Harvard Law School) apply to your case
#SDSS2019 @stepaniehicks good case study: based on real, contemporary situations; engages students, makes them make choices; link theory and practice; there should be a little bit of ambiguity even in the end of the case study analysis
#SDSS2019 @stephaniehicks
1. Teach statistical thinking.
2. Focus on conceptual understanding.
3. Integrate real data with context and purpose.
4. Foster active learning
5. Use technology to explore concepts, analyze data.
6. Use assessments to improve, evaluate student learning
#SDSS2019 @stephaniehicks case studies in data science -- Nolan and Speed 1999 tandfonline.com/doi/abs/10.108…
#SDSS2019 @stephaniehicks relation between health care spending and coverage -> find the data -> plot them -> quantify relations -> BTW this is called "linear regression"
#SDSS2019 @stephaniehicks Nolan and Speed 1999 call for "substantial exercises with nontrivial solutions that leave room for different analyses". That will help students IRL where there may not be a single solution
#SDSS2019 @stephaniehicks steps in case study: 1. intro 2. data 3. background 4. investigations 5. theory
#SDSS2019 @stephaniehicks structure of a case study: the final plot first to motivate; what is the data, import, wrangling, exploratory data set (you can jump in to that section in your lecture), data analysis, summary of results
#SDSS2019 @stephaniehicks Git and GitHub workflows in the classroom -- start with happygitwithr.com by @JennyBryan
#SDSS2019 @stephaniehicks Infrastructure and Tools for Teaching Computing Throughout the Statistical Curriculum tandfonline.com/doi/abs/10.108… @minebocek
arxiv.org/abs/1811.02021 final part: incorporating @SlackHQ into the classroom -- reminder of yesterday's quote of "Email and IM having a baby -> Slack"
#SDSS2019 @stephaniehicks also gave shoutout to @rudeboybert presentation (Bert please post a link about here)
#SDSS2019 @stephaniehicks Slack allows students to answer each others questions; @jhubiostat also uses Slack internally to communicate between faculty, students, alumni -- channels on food, seminars, etc.
#SDSS2019 @stephaniehicks Slack for outreach -- @RLadiesBmore chapter of @RLadiesGlobal -- Stephanie shows some Xmas trees drawn in R
#SDSS2019 @stephaniehicks promotion of The Corresponding Author @CorrespondAuth podcast
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Stas Kolenikov
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!