What is data lineage and why is it important when building ML systems?
From @chipro’s new book, Designing Machine Learning Systems: 1/5
Data lineage is the process of keeping track of the origin of your data and tracking versions of it over time.
This is important if your data changes and you want to keep track of which model was trained using which data and how the model performance is affected.
2/5
You could track data versions yourself but it'll likely be as error prone as "model_latest_latest_actual_latest_2021.pth" is when tracking models.
@weights_biases Artifacts is one way you can track the data you used to train your models with a few lines of code. 3/5
Here’s @charles_irl showing you how to use Artifacts:
The model in this paper learns to associate one or more objects to the effects they have on their environment (shadows, reflections, etc.) for a given video and rough segmentation masks of each object. This enables video effects like "background replacement". 2/5
and "color pop" and a "stroboscopic" effect (in the next tweet): 3/5
Here's a little summary of the different parts for those curious: 1/5
The Dataset has to be passed to the DataLoader. It's where you transform your data and where the inputs and labels are stored.
It is basically one big list of (input, label) tuples. So when you index in like dataset[i], it returns (input[ i ], label[ i ]).
2/5
The sampler and batch sampler are used to choose which inputs & labels to yield to the next batch.
Artistic license warning 👨🎨⚠️: They don't actually grab the inputs and labels, and the dataset doesn't actually deplete. They tell the DataLoader which indices to grab.
3/5