๐ก๐ผ ๐๐ ๐ฐ๐๐๐ฒ๐ ๐๐ฎ๐๐ฎ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด ๐ฃ๐ผ๐ฟ๐๐ณ๐ผ๐น๐ถ๐ผ ๐ง๐ฒ๐บ๐ฝ๐น๐ฎ๐๐ฒ - next week I will enrich it with the missing Machine Learning and MLOps parts!
Today - letโs review it once more. It is super helpful as these kind of Data Architectures are what you will find in real life situations.
๐ฅ๐ฒ๐ฐ๐ฎ๐ฝ:
๐
๐ญ. Data Producers - Python Applications that extract data from chosen Data Sources and push it to Collector via REST or gRPC API calls.
๐
2. Collector - REST or gRPC server written in Python that takes a payload, validates top level field existence and correctness, adds additional metadata and pushes it into either Raw Events Topic if the validation passes or a Dead Letter Queue if top level fields are invalid.
๐
3. Enricher/Validator - Python or Spark Application that validates event schema in Raw Events Topic, performs data enrichment and pushes results into either Enriched Events Topic if validation passes and enrichment succeeds or a Dead Letter Queue if any of previous failed.
๐
๐ฐ. Enrichment API - API of any flavor implemented with Python that can be called for enrichment purposes by Enricher/Validator. This could be a Machine Learning Model deployed as an API as well.
๐
๐ฑ. Real Time Loader - Python or Spark Application that reads data from Enriched Events and Enriched Events Dead Letter Topics and writes them in real time to ElasticSearch Indexes for Analysis and alerting.
๐
๐ฒ. Batch Loader - Python or Spark Application that reads data from Enriched Events Topic, batches it in memory and writes to MinIO Object Storage.
๐
๐ณ. Scripts Scheduled via Airflow that read data from Enriched Events MinIO bucket, validates data quality, performs deduplication and any additional Enrichments. Here you also construct your Data Model to be later used for reporting purposes.
๐
๐ง๐ญ. A single Kafka instance that will hold all of the Topics for the Project.
๐ง๐ฎ. A single MinIO instance that will hold all of the Buckets for the Project.
๐ง๐ฏ. Airflow instance that will allow you to schedule Python or Spark Batch jobs against data stored in MinIO.
๐
๐ง๐ฐ. Presto/Trino cluster that you mount on top of Curated Data in MinIO so that you can query it using Superset.
๐ง๐ฑ: ElasticSearch instance to hold Real Time Data.
๐
๐ง๐ฒ. Superset Instance that you mount on top of Trino Querying Engine for Batch Analytics and Elasticsearch for Real Time Analytics.
Join a growing community of 5200+ Data Enthusiasts by subscribing to my ๐ก๐ฒ๐๐๐น๐ฒ๐๐๐ฒ๐ฟ: swirlai.substack.com/p/sai-15-whatsโฆ
Lambda and Kappa are both Data architectures proposed to solve movement of large amounts of data for reliable Online access.
๐
The most popular architecture has been and continues to be Lambda. However, with Stream Processing becoming more accessible to organizations of every size you will be hearing a lot more of Kappa in the near future. Letโs see how they are different.
Letโs remind ourselves of how a ๐ฅ๐ฒ๐พ๐๐ฒ๐๐-๐ฅ๐ฒ๐๐ฝ๐ผ๐ป๐๐ฒ ๐ ๐ผ๐ฑ๐ฒ๐น ๐๐ฒ๐ฝ๐น๐ผ๐๐บ๐ฒ๐ป๐ looks like - ๐ง๐ต๐ฒ ๐ ๐๐ข๐ฝ๐ ๐ช๐ฎ๐.
You will find this type of model deployment to be the most popular when it comes to Online Machine Learning Systems.
Let's zoom in:
๐ญ: Version Control: Machine Learning Training Pipeline is defined in code, once merged to the main branch it is built and triggered.
๐
๐ฎ: Feature Preprocessing: Features are retrieved from the Feature Store, validated and passed to the next stage. Any feature related metadata that is tightly coupled to the Model being trained is saved to the Experiment Tracking System.
1๏ธโฃ โ๐๐๐ป๐ฑ๐ฎ๐บ๐ฒ๐ป๐๐ฎ๐น๐ ๐ผ๐ณ ๐๐ฎ๐๐ฎ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ดโ - A book that I wish I had 5 years ago. After reading it you will understand the entire Data Engineering workflow. It will prepare you for further deep dives.
๐
2๏ธโฃ โ๐๐ฐ๐ฐ๐ฒ๐น๐ฒ๐ฟ๐ฎ๐๐ฒโ - Data Engineers should follow the same practices that Software Engineers do and more. After reading this book you will understand DevOps practices in and out.
Feature Store System sits between Data Engineering and Machine Learning Pipelines and it solves the following issues:
โก๏ธ Eliminates Training/Serving skew by syncing Batch and Online Serving Storages (5)
๐
โก๏ธ Enables Feature Sharing and Discoverability through the Metadata Layer - you define the Feature Transformations once, enable discoverability through the Feature Catalog and then serve Feature Sets for training and inference purposes trough unified interface (4๏ธ,3).
๐๐ต๐ฎ๐ป๐ด๐ฒ ๐๐ฎ๐๐ฎ ๐๐ฎ๐ฝ๐๐๐ฟ๐ฒ is a software process used to replicate actions performed against Operational Databases for use in downstream applications.
โก๏ธ ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ ๐ฅ๐ฒ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป (refer to 3๏ธโฃ in the Diagram).
๐ CDC can be used for moving transactions performed against Source Database to a Target DB. If each transaction is replicated - it is possible to retain all ACID guarantees when performing replication.
It should be composed of two integrated parts: Experiment Tracking System and a Model Registry.
From where you track ML Pipeline metadata will depend on MLOps maturity in your company.
If you are at the beginning of the ML journey you might be:
๐
1๏ธโฃ Training and Serving your Models from experimentation environment - you run ML Pipelines inside of your Notebook and do that manually at each retraining.
If you are beyond Notebooks you will be running ML Pipelines from CI/CD Pipelines and on Orchestrator triggers.