Tweet

Alex Gold

Jul 9 • 7 tweets • 3 min read

If you're a #dataScientist working in #rstats, you've probably heard of #docker, but might not know why you'd care or how to get started...

Here's a thread summarizing my talk from #useR2022 from a few weeks ago, and a link to my free online book📕!

🧵

Docker is a general tool for packaging code with its dependencies.

So in the data science world, Docker is a tool for reproducibility + portability. Docker can make it easy to share your work with others, or to keep it safe for later.

When you think about reproducing an R project, there are layers of reproducibility, as you make things more reproducible, it also takes more work.

The top layers of the reproducibility stack -- code, data, and R packages have existing tooling to reproduce. #git, {renv} for package libraries.

And who knows what for "reproducing" data. I don't actually have an answer on that one. It really depends on your data.

But those middle layers -- R versions, System Libraries, and the Operating System dependencies. Docker is a ⭐STAR⭐ here.

You can create a container image with a simple Dockerfile, and then have the environment up and running in just moments (once you've downloaded the image).

If you've never dealt with Docker before, here's a simple model of the states of containers and images.

Alright, you're onboard, but how?

Check out my book DevOps 4 Data Science. It's currently in draft form, but the Docker chapter is reasonably complete. It'll be out in print...sometime...but there'll always be a free online copy. Enjoy!

do4ds.com/chapters/sec1/…

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Alex Gold

People who liked this thread also liked...

Try unrolling a thread yourself!

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?