@DVCorg is one of the fastest growing ML experiment management tools.
The main idea of DVC is to *track ML experiments in git*
Everything is versioned -- the code, the data, the model, and the metrics created by your experiment. Pretty powerful!
The magic of DVC is that it supports datasets and models too large to store in github.
And since every part of your experiment is versioned, you can easily roll back to an earlier run and reproduce it.
No more fiddling around to recreate that experiment from two weeks ago!
What are the tradeoffs? (1/2)
*DVC does a lot*
Versioning data, experiment tracking, and running pipelines. You might prefer lighter weight tools (e.g., replicate.ai) for any one of these
What are the tradeoffs? (2/3)
*DVC imposes a workflow*
Each experiment is like a commit that you make by running your script through `dvc run`. Other tools like @weights_biases integrate into how you do things now
What are the tradeoffs? (3/3)
* DVC versions your data, but the diffs are limited*
@pachyderminc focuses on versioning your entire data pipeline, and @DoltHub versions your dataset more granularly at the row level
For many these are easy tradeoffs for reproducible ML experiments out of the box!
What other tools do you like for experiment tracking, reproducibility, and data versioning?
• • •
Missing some Tweet in this thread? You can try to
force a refresh
@DeepnoteHQ is an epic Jupyter notebook alternative:
- Improved UX
- Real-time collaboration (editing and discussion)
- Direct connections to your data stores, including Postgres, S3, and BigQuery
- Effortless sharing of your running notebook
👇
One major con: Deepnote does not yet support GPU compute.
For data scientists who don't need to train deep learning models, Deepnote is a great tool to check out. It improves your developer experience and allows effortless sharing of your work with your teammates and manager.
While the Deepnote team is working on adding GPU support, there's another Jupyter-like cloud notebook you can use for deep learning: @GoogleColab.
If you use it, we recommend signing up for their $10/month Pro plan for priority access to TPUs, longer runtimes, and more RAM.
Want to follow along as we post lectures publicly?👇
2/ Sign up to receive updates on our lectures as they're released (and to optionally participate in a synchronous learning community): forms.gle/zqE2rjkfqex2AQ…
3/ We cover the full stack, from project management to MLOps:
- Formulating the problem and estimating cost
- Managing, labeling, and processing data
- Making the right HW and SW choices
- Troubleshooting and reproducing training
- Deploying the model at scale
1/ Last week's Production ML meetup featured Peter Gao and Princeton Kwong, former Engineering Managers at Cruise and Aquabyte. Below, their insights on data quality and its downstream effects for computer vision use cases:
2/ What is your experience with data quality?
- Cruise: (1) poorly-labeled data confuses the model; (2) models may perform poorly on edge case objects
3/ Data quality cont'd
- Aquabyte: No public datasets to work with. Our engineers went onsite to collect ground-truth data, built a huge labeling pipeline to get the data to human labelers, and designed our own labeling interface that enabled labelers to properly label fishes.