Thread by @jeroenhjanssens on Thread Reader App

🎉 The second edition of Data Science at the Command Line is out! 🎉

You can read the entire book for free at datascienceatthecommandline.com

It took a good year to rewrite and expand the first edition, so I'd like to say a few things. 🧵

First of all, many thanks to @OReillyMedia for allowing me to make the book available for free under a CC BY-ND license. You can, of course, also buy a physical copy from your favorite bookstore.

I'm grateful for all the help I've received. Most notably my editors @JessHaberman @GreyEditing Kate Galloway and my reviewers Aaditya Maruthi @bde @beeonaposy @juliasilge @mikedewar @reustle. The acknowledgements lists many others, without whom, I couldn't have pulled this off.

If you're only going to read two pages of this book, let it be those that @timoreilly wrote. Tim, thank you for writing such an inspiring foreword.

Writing this book triggered a decent amount of imposter syndrome, so I'm very happy with all the positive reactions so far, including the generous praise by @dancow, @chrishwiggins, @JohnDCook, @DynamicWebPaige, @jaredlander, and @jakehofman.

Let me give you a rundown of the actual content!

Chapter 1 discusses the OSEMN model for data science by @hmason and @chrishwiggins and explains why I believe the command line can be helpful here.

Chapter 2 covers how to get started with the Docker image (which contains over 💯 tools!) and introduces some essential Unix concepts such as working with redirection and getting help.

Chapter 3 shows several ways of obtaining data; the first step of the OSEMN model. APIs, databases, Excel sheets, web pages; nothing is safe from the command line!

Chapter 4 explains how you can create your own command-line tools using #bash, #python, or #rstats.

True story: In 2014 I competed in the US beatboxing championship. My stage name was shebang (after #!). I thought it sounded cool until the MC introduced me as "she bang". 🥲🎤

Chapter 5 is all about scrubbing (aka cleaning) data. I discuss classic tools such as grep and awk, and newer tools such as jq and pup.

Chapter 6 introduces the essentials of make, a command-line tool to formalize your data workflow steps in terms of input and output dependencies. Confession: I often use make as a glorified task runner for my projects.

Chapter 7 talks about exploring data, mainly through visualization. I demonstrate rush, a tool that allows you to run R one liners from the command line.

Chapter 8 demonstrates GNU parallel, a wonderful tool to parallelize and distribute your pipeline. Run cowsay on dozens of EC2 instances! If AWS is up of course. 🤓

Chapter 9 covers modeling data, where I demonstrate how you can do dimensionality reduction, regression, and classification at the command line.

Chapter 10 is an entirely new chapter! It's about using multiple tools and programming languages together, including Jupyter, Python, RStudio, R, and Apache Spark.

Chapter 11 concludes the book with three pieces of advice and references to some excellent resources if you want to learn more.

And last but certainly not least, the appendix lists all the tools used in the book (and which are installed in the Docker container) together with citations and examples. Keep in mind that tools come and go. The command line itself, however, is here to stay.

While this book was challenging to write at times, I also had a lot of fun (I managed to sneak in a few Easter eggs and obscure jokes). In any case, I hope you'll find this book helpful. If you want to help me, please consider leaving a review on Amazon.

One more thing: I'm working on a brand new course! It's tentatively titled "Embracing the Command Line". First beta cohort expected to start in Q1 2022. You can learn more about this course and let me know what you think at datascienceatthecommandline.com/#course

If you can't wait to learn from me in person: I regularly give workshops about the command line, Python, R, etc. for a living (both on-site and online). In fact, it's the first edition from 2014 that eventually allowed me start my own company datascienceworkshops.com. Cheers!

The next two-day workshop "Data Science at the Command Line" will be on March 10-11, 2022 from 10am to 3pm EST.

Join me on Zoom for eight hours of interactive, hands-on sessions. Sign up via Gumroad.

datascienceworkshops.gumroad.com/l/data-science…

Data Science at the Command Line on Hacker News news.ycombinator.com/item?id=295893…

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll