Facundo Muñoz famuvie@oc.todon.fr Profile picture
Biostatistician @Cirad. Always a student. #rstats. #Bayesian methods and Spatial(-temporal) applied #statistics. #OpenScience and #ReproducibleResearch.

May 3, 2021, 13 tweets

The evolution of the #RStats script (a thread).

1. You make some analysis with a R script.
You want to share it with some collaborator so she can explore and review the code, propose modifications, fixes, improvements, etc. You send the script by e-mail, along with the data

Problem: The script is not portable. She needs to substitute some platform-specific packages and functions and modify all paths to the data and to various files. When she sends back her edits, you need to manually revert those changes back again so that it works at your place.

2. The project directory.

You organise files in a directory with a standard structure (src, data, reports) and only use relative paths. You adopt UTF-8 encoding and cross-platform packages and functions.
Your collaborator sends a document reporting and discussing the results.

Problem: The figures and tables that you get don't match those in the report. Besides, each time you change something, you need to manually update all the results in the report, or at least verify that they didn't change. It is easy to overlook or forget something.

3. Rmarkdown

You integrate the script and the report into a single source document by substituting all the results in the report by the corresponding R-code that generates them. You have now several reports (descriptive, alternative models, and one for model comparison).

Problem: The document becomes larger and things difficult to find. Compiling is slow due to some time-consuming steps. Each day, you need to execute all chunks from the beginning. The various reports (specially the last one) use objects from the others that need to be recomputed.

4. The targets package
docs.ropensci.org/targets/
You use the #targets package to separate the computations from the reports. The package keeps track of the dependencies among objects, and you can retrieve any result from inside the Rmarkdown documents.

Problem: Integrating changes from various collaborators and keeping track of modifications.

5. The versioned project in a git repository.

You initialise a local git repository and push your changes into a remote repository accessible for your collaborators.

Problem: Discuss results with other collaborators in the project that don't need nor want to set up all the infrastructure. Sharing by e-mail megabytes in attachments each time the reports are updated. Also, sometimes there are confusions about the latest version of the reports.

6. Publishing reports online using Continuous Integration and Git(La|Hu)b pages.

Now you can work simultaneously on the same or different documents, integrate everything automagically and make sure the reports are up to date and online with the push of a button.

Problem: The project is finalised and you want to share it more widely. Even with full access to the code, it is not trivial for most people to set up a suitable environment. Installing R and the necessary packages at the same or compatible versions may not even be desirable.

7. Docker

You distribute a #docker image containing a operating system, the appropriate version of R and R packages and your full repository, including cached targets objects.

The nirvana of #reproducibility.

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling