Let's talk about setting up our Python/CUDA environment!
Our goals:
- Easily specify exact Python and CUDA versions
- Humans should not be responsible for finding mutually-compatible package versions
- Production and dev requirements should be separate
1/N
Here's a good way to achieve these goals:
- Use `conda` to install Python/CUDA as specified in `environment.yml`
- Use `pip-tools` to lock in mutually compatbile versions from `requirements/prod.in` and `requirements/dev.in`
dagster describes themselves as a "data orchestrator for machine learning, analytics, and ETL"
Let's break that down 👇
2/ When you work with real-world data, your pipelines can get complex.
E.g., to train a language model on twitter, you might:
- Download data
- Strip out offensive tweets
- Preprocess the data
- Fit models
- Summarize training performance
- Deploy the best model to production
3/ In production settings, pipelines can be even more complicated.
All well and good, but doing those steps manually every time you update your model is painful, resource intensive, and hard to scale.
And what happens if you have hundreds of these pipelines you need to manage?
@DeepnoteHQ is an epic Jupyter notebook alternative:
- Improved UX
- Real-time collaboration (editing and discussion)
- Direct connections to your data stores, including Postgres, S3, and BigQuery
- Effortless sharing of your running notebook
👇
One major con: Deepnote does not yet support GPU compute.
For data scientists who don't need to train deep learning models, Deepnote is a great tool to check out. It improves your developer experience and allows effortless sharing of your work with your teammates and manager.
While the Deepnote team is working on adding GPU support, there's another Jupyter-like cloud notebook you can use for deep learning: @GoogleColab.
If you use it, we recommend signing up for their $10/month Pro plan for priority access to TPUs, longer runtimes, and more RAM.
Want to follow along as we post lectures publicly?👇
2/ Sign up to receive updates on our lectures as they're released (and to optionally participate in a synchronous learning community): forms.gle/zqE2rjkfqex2AQ…
3/ We cover the full stack, from project management to MLOps:
- Formulating the problem and estimating cost
- Managing, labeling, and processing data
- Making the right HW and SW choices
- Troubleshooting and reproducing training
- Deploying the model at scale