If you're a #dataScientist working in #rstats, you've probably heard of #docker, but might not know why you'd care or how to get started...
Here's a thread summarizing my talk from #useR2022 from a few weeks ago, and a link to my free online book📕!
🧵
Docker is a general tool for packaging code with its dependencies.
So in the data science world, Docker is a tool for reproducibility + portability. Docker can make it easy to share your work with others, or to keep it safe for later.
When you think about reproducing an R project, there are layers of reproducibility, as you make things more reproducible, it also takes more work.
The top layers of the reproducibility stack -- code, data, and R packages have existing tooling to reproduce. #git, {renv} for package libraries.
And who knows what for "reproducing" data. I don't actually have an answer on that one. It really depends on your data.
But those middle layers -- R versions, System Libraries, and the Operating System dependencies. Docker is a ⭐STAR⭐ here.
You can create a container image with a simple Dockerfile, and then have the environment up and running in just moments (once you've downloaded the image).
If you've never dealt with Docker before, here's a simple model of the states of containers and images.
Alright, you're onboard, but how?
Check out my book DevOps 4 Data Science. It's currently in draft form, but the Docker chapter is reasonably complete. It'll be out in print...sometime...but there'll always be a free online copy. Enjoy!