🛠️ Tooling Tuesday 🛠️

This week: @dagsterio (dagster.io)

dagster describes themselves as a "data orchestrator for machine learning, analytics, and ETL"

Let's break that down 👇
2/ When you work with real-world data, your pipelines can get complex.

E.g., to train a language model on twitter, you might:
- Download data
- Strip out offensive tweets
- Preprocess the data
- Fit models
- Summarize training performance
- Deploy the best model to production
3/ In production settings, pipelines can be even more complicated.

All well and good, but doing those steps manually every time you update your model is painful, resource intensive, and hard to scale.

And what happens if you have hundreds of these pipelines you need to manage?
4/ Enter workflow engines.

Workflow engines help you define pipelines as directed acyclic graphs (DAGs) of tasks written in any language.

They allow you to run DAGs on a schedule (or in response to triggers), monitor their execution, reproduce them, and scale them
5/ @ApacheAirflow is the most popular workflow engine. It was developed at AirBnB in 2014 - 2015, and has become a standard since.
6/ Despite its popularity, Airflow is not universally loved.

Teams often complain that things like:

- Debugging
- Testing
- Fast local development
- Keeping track of produced data assets

are painful in Airflow
7/ Airflow is a "dumb" workflow executor. It doesn't maintain state about executions, and doesn't understand data passed between steps.

This is an advantage because it makes it lightweight and general.

But @dagsterio thinks it is also the root cause of usability complaints
8/ Like Airflow, @dagsterio lets you implement tasks in any language and define dependencies using python.

Unlike Airflow, Dagster:
- Types inputs and outputs
- Tracks data assets
- Separates configuration from code
- Produces execution metadata for better observability
9/ Dagster calls itself a "data orchestrator" instead of a "workflow engine" because of some of these opinionated choices to improve usability in the particular case where your workflow is a data pipeline
10/ Dagster is also focused on making an easy-to-use local development experience with a nice GUI, and a seamless transition from local to remote execution.

Sounds like an upgrade! So what is the tradeoff?
11/ Dagster is a more structured and constrained programming model than Airflow or its other competitors.

There are more abstractions to learn when you are getting started. Usage is more rigid. E.g., some users say it's challenging to build graphs dynamically.
12/ If you're looking for a better way to build data pipleines, @dagsterio is worth a try!

Other workflow platforms worth checking out:
- github.com/spotify/luigi
- @kubeflow pipelines
- @PrefectIO
- @pachyderminc
13/ What else do you like about Dagster? Any other great tools for workflow execution that we missed?

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Full Stack Deep Learning

Full Stack Deep Learning Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @full_stack_dl

11 Dec
1/ @lishali88 and @spring_stream joined us to talk about building Rosebud.ai.

Rosebud.ai's @tokkingheads turns portraits into animated avatars that read text you provide. It's fun to play around with!

Here are some challenges they faced building it:
2/ A scalable model training platform was key to experimenting quickly enough to build talkingheads.rosebud.ai.

They built theirs on Kubernetes and take advantage of spot instances to keep costs down.

More on their training infra here: blog.rosebud.ai/cost-efficient…
3/ Model quality is key to their product, so Rosebud prioritizes that over performance.

They're looking into model compression techniques to make big models faster (and more cost effective).
Read 9 tweets
9 Dec
🛠️FSDL Tooling Tuesday🛠️

@DVCorg is one of the fastest growing ML experiment management tools.

The main idea of DVC is to *track ML experiments in git*

Everything is versioned -- the code, the data, the model, and the metrics created by your experiment. Pretty powerful!
The magic of DVC is that it supports datasets and models too large to store in github.

And since every part of your experiment is versioned, you can easily roll back to an earlier run and reproduce it.

No more fiddling around to recreate that experiment from two weeks ago!
What are the tradeoffs? (1/2)

*DVC does a lot*

Versioning data, experiment tracking, and running pipelines. You might prefer lighter weight tools (e.g., replicate.ai) for any one of these
Read 6 tweets
1 Dec
🛠FSDL Tooling Tuesday🛠

@DeepnoteHQ is an epic Jupyter notebook alternative:

- Improved UX
- Real-time collaboration (editing and discussion)
- Direct connections to your data stores, including Postgres, S3, and BigQuery
- Effortless sharing of your running notebook

👇 Image
One major con: Deepnote does not yet support GPU compute.

For data scientists who don't need to train deep learning models, Deepnote is a great tool to check out. It improves your developer experience and allows effortless sharing of your work with your teammates and manager.
While the Deepnote team is working on adding GPU support, there's another Jupyter-like cloud notebook you can use for deep learning: @GoogleColab.

If you use it, we recommend signing up for their $10/month Pro plan for priority access to TPUs, longer runtimes, and more RAM.
Read 5 tweets
19 Nov
1/ FSDL helps you turn ML experiments into shipped products with real-world impact.

This Spring, @josh_tobin_ @sergeykarayev & @pabbeel are teaching an improved version as an official Berkeley course: bit.ly/berkeleyfsdl

Want to follow along as we post lectures publicly?👇
2/ Sign up to receive updates on our lectures as they're released (and to optionally participate in a synchronous learning community): forms.gle/zqE2rjkfqex2AQ…
3/ We cover the full stack, from project management to MLOps:

- Formulating the problem and estimating cost
- Managing, labeling, and processing data
- Making the right HW and SW choices
- Troubleshooting and reproducing training
- Deploying the model at scale
Read 5 tweets
17 Nov
1/ Last week's Production ML meetup featured Peter Gao and Princeton Kwong, former Engineering Managers at Cruise and Aquabyte. Below, their insights on data quality and its downstream effects for computer vision use cases:
2/ What is your experience with data quality?

- Cruise: (1) poorly-labeled data confuses the model; (2) models may perform poorly on edge case objects
3/ Data quality cont'd

- Aquabyte: No public datasets to work with. Our engineers went onsite to collect ground-truth data, built a huge labeling pipeline to get the data to human labelers, and designed our own labeling interface that enabled labelers to properly label fishes.
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!