It should be composed of two integrated parts: Experiment Tracking System and a Model Registry.
From where you track ML Pipeline metadata will depend on MLOps maturity in your company.
If you are at the beginning of the ML journey you might be:
๐
1๏ธโฃ Training and Serving your Models from experimentation environment - you run ML Pipelines inside of your Notebook and do that manually at each retraining.
If you are beyond Notebooks you will be running ML Pipelines from CI/CD Pipelines and on Orchestrator triggers.
๐
In any case, the ML Pipeline will not be too different and a well designed System should track at least:
๐
2๏ธโฃ Datasets used for Training Machine Learning Models in Experimentation or Production ML Pipelines. Here you should also track your Train/Test splits. At this stage you should also save all important metrics that relate to Datasets - Feature Distribution etc.
๐
3๏ธโฃ Model Parameters (e.g. model type, hyperparameters) together with Model Performance metrics.
4๏ธโฃ Model Artifact Location.
๐
5๏ธโฃ Machine Learning Pipeline is an artifact itself - track information about who and when triggered it. Pipeline ID etc.
โ Code: Everything is code - you should version and track it.
๐
When a Trained Model Artifact is saved to a Model Registry there should always be a 1 to 1 mapping of previously saved Model Metadata to The Artifact which was outputted to The Model Registry:
๐
โก๏ธ Model Registry should have a convenient user interface in which you can compare metrics of different Experiment versions.
๐
โก๏ธ Model Registry should have a capability that allows change of Model State with a single click of a button. Usually it would be a change of state between Staging and Production.
๐
Finally:
6๏ธโฃ Model Tracking System should be integrated with the Model Deployment System. Once a model state is changed to Production, Deployment Pipeline is triggered - new model version is deployed, old one - decommissioned.
๐
Model Tracking System containing these properties helps in the following way:
โก๏ธ You will be able to understand how a Model was built and repeat the experiment.
โก๏ธ You will be able to share experiments with other experts involved.
๐
โก๏ธ You will be able to perform rapid and controlled experiments.
โก๏ธ The system will allow safe rollbacks to any Model Version.
โก๏ธ Such a Self-Service System would remove friction between ML and Operations experts.
Join a growing community of 3000+ Data Enthusiasts by subscribing to my ๐ก๐ฒ๐๐๐น๐ฒ๐๐๐ฒ๐ฟ: swirlai.substack.com/p/sai-10-airflโฆ
โข โข โข
Missing some Tweet in this thread? You can try to
force a refresh
1๏ธโฃ โ๐๐๐ป๐ฑ๐ฎ๐บ๐ฒ๐ป๐๐ฎ๐น๐ ๐ผ๐ณ ๐๐ฎ๐๐ฎ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ดโ - A book that I wish I had 5 years ago. After reading it you will understand the entire Data Engineering workflow. It will prepare you for further deep dives.
๐
2๏ธโฃ โ๐๐ฐ๐ฐ๐ฒ๐น๐ฒ๐ฟ๐ฎ๐๐ฒโ - Data Engineers should follow the same practices that Software Engineers do and more. After reading this book you will understand DevOps practices in and out.
Feature Store System sits between Data Engineering and Machine Learning Pipelines and it solves the following issues:
โก๏ธ Eliminates Training/Serving skew by syncing Batch and Online Serving Storages (5)
๐
โก๏ธ Enables Feature Sharing and Discoverability through the Metadata Layer - you define the Feature Transformations once, enable discoverability through the Feature Catalog and then serve Feature Sets for training and inference purposes trough unified interface (4๏ธ,3).
๐๐ต๐ฎ๐ป๐ด๐ฒ ๐๐ฎ๐๐ฎ ๐๐ฎ๐ฝ๐๐๐ฟ๐ฒ is a software process used to replicate actions performed against Operational Databases for use in downstream applications.
โก๏ธ ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ ๐ฅ๐ฒ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป (refer to 3๏ธโฃ in the Diagram).
๐ CDC can be used for moving transactions performed against Source Database to a Target DB. If each transaction is replicated - it is possible to retain all ACID guarantees when performing replication.