Let's talk about setting up our Python/CUDA environment!
Our goals:
- Easily specify exact Python and CUDA versions
- Humans should not be responsible for finding mutually-compatible package versions
- Production and dev requirements should be separate
1/N
Here's a good way to achieve these goals:
- Use `conda` to install Python/CUDA as specified in `environment.yml`
- Use `pip-tools` to lock in mutually compatbile versions from `requirements/prod.in` and `requirements/dev.in`
dagster describes themselves as a "data orchestrator for machine learning, analytics, and ETL"
Let's break that down 👇
2/ When you work with real-world data, your pipelines can get complex.
E.g., to train a language model on twitter, you might:
- Download data
- Strip out offensive tweets
- Preprocess the data
- Fit models
- Summarize training performance
- Deploy the best model to production
3/ In production settings, pipelines can be even more complicated.
All well and good, but doing those steps manually every time you update your model is painful, resource intensive, and hard to scale.
And what happens if you have hundreds of these pipelines you need to manage?