Lambda and Kappa are both Data architectures proposed to solve movement of large amounts of data for reliable Online access.
๐
The most popular architecture has been and continues to be Lambda. However, with Stream Processing becoming more accessible to organizations of every size you will be hearing a lot more of Kappa in the near future. Letโs see how they are different.
๐
๐๐ฎ๐บ๐ฏ๐ฑ๐ฎ.
โก๏ธ Ingestion layer is responsible for collecting the raw data and duplicating it for further Real Time and Batch processing separately.
โก๏ธ Consists of 3 additional main layers:
๐
๐ Speed or Stream - Raw Data here is coming in Real Time and is processed by a Stream Processing Framework (e.g. Flink) then passed to the Serving layer to create Real Time Views for low latency near to Real Time Data access.
๐
๐ Batch - Batch ETL Jobs with batch processing Frameworks (e.g. Spark) are run against raw Data to create reliable Batch Views for Offline Historical Data access.
๐
๐ Serving - this is where the processed Data is exposed to the end user. Latest Real Time Data can be accessed from Real Time Views or combined with Batch Views for full history. Historical Data can be accessed from Batch Views.
๐
โ๏ธ Processing code is duplicated for different technologies in Batch and Speed Layers causing logic divergence.
โ๏ธ Compute resources are duplicated.
๐
โ๏ธ Need to manage two Infrastructures.
โ Distributed Batch Storage is reliable and scalable, even if the System crashes it is easily recoverable without errors.
๐
๐๐ฎ๐ฝ๐ฝ๐ฎ.
โก๏ธ Treats both Batch and Real Time Workloads as a Stream Processing problem.
โก๏ธ Uses Speed Layer only to prepare data for Real Time and Batch Access.
โก๏ธ Consists of only 2 main layers:
๐
๐ Speed or Stream - similar to Lambda but (optionally) often contains Tiered Storage which means that all of Data coming into the system is stored indefinitely in different Storage Layers. E.g. S3 or GCS for historical data and on disk log for hot data.
๐
๐ Serving - same as Lambda but all transformations that are performed in Speed Layer are never duplicated in Batch Layer.
โ๏ธ Some transformations are hard to perform in Speed Layer (e.g. complex joins) and are eventually pushed to Batch storage for implementation.
๐
โ๏ธ Requires strong skills in Stream Processing.
โ Data is processed once with a single Stream Processing Engine.
โ Only need to manage single set of Infrastructure.
Join a growing community of 5100+ Data Enthusiasts by subscribing to my ๐ก๐ฒ๐๐๐น๐ฒ๐๐๐ฒ๐ฟ: swirlai.substack.com/p/sai-15-whatsโฆ
๐ก๐ผ ๐๐ ๐ฐ๐๐๐ฒ๐ ๐๐ฎ๐๐ฎ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด ๐ฃ๐ผ๐ฟ๐๐ณ๐ผ๐น๐ถ๐ผ ๐ง๐ฒ๐บ๐ฝ๐น๐ฎ๐๐ฒ - next week I will enrich it with the missing Machine Learning and MLOps parts!
Letโs remind ourselves of how a ๐ฅ๐ฒ๐พ๐๐ฒ๐๐-๐ฅ๐ฒ๐๐ฝ๐ผ๐ป๐๐ฒ ๐ ๐ผ๐ฑ๐ฒ๐น ๐๐ฒ๐ฝ๐น๐ผ๐๐บ๐ฒ๐ป๐ looks like - ๐ง๐ต๐ฒ ๐ ๐๐ข๐ฝ๐ ๐ช๐ฎ๐.
You will find this type of model deployment to be the most popular when it comes to Online Machine Learning Systems.
Let's zoom in:
๐ญ: Version Control: Machine Learning Training Pipeline is defined in code, once merged to the main branch it is built and triggered.
๐
๐ฎ: Feature Preprocessing: Features are retrieved from the Feature Store, validated and passed to the next stage. Any feature related metadata that is tightly coupled to the Model being trained is saved to the Experiment Tracking System.
1๏ธโฃ โ๐๐๐ป๐ฑ๐ฎ๐บ๐ฒ๐ป๐๐ฎ๐น๐ ๐ผ๐ณ ๐๐ฎ๐๐ฎ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ดโ - A book that I wish I had 5 years ago. After reading it you will understand the entire Data Engineering workflow. It will prepare you for further deep dives.
๐
2๏ธโฃ โ๐๐ฐ๐ฐ๐ฒ๐น๐ฒ๐ฟ๐ฎ๐๐ฒโ - Data Engineers should follow the same practices that Software Engineers do and more. After reading this book you will understand DevOps practices in and out.
Feature Store System sits between Data Engineering and Machine Learning Pipelines and it solves the following issues:
โก๏ธ Eliminates Training/Serving skew by syncing Batch and Online Serving Storages (5)
๐
โก๏ธ Enables Feature Sharing and Discoverability through the Metadata Layer - you define the Feature Transformations once, enable discoverability through the Feature Catalog and then serve Feature Sets for training and inference purposes trough unified interface (4๏ธ,3).
๐๐ต๐ฎ๐ป๐ด๐ฒ ๐๐ฎ๐๐ฎ ๐๐ฎ๐ฝ๐๐๐ฟ๐ฒ is a software process used to replicate actions performed against Operational Databases for use in downstream applications.
โก๏ธ ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ ๐ฅ๐ฒ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป (refer to 3๏ธโฃ in the Diagram).
๐ CDC can be used for moving transactions performed against Source Database to a Target DB. If each transaction is replicated - it is possible to retain all ACID guarantees when performing replication.
It should be composed of two integrated parts: Experiment Tracking System and a Model Registry.
From where you track ML Pipeline metadata will depend on MLOps maturity in your company.
If you are at the beginning of the ML journey you might be:
๐
1๏ธโฃ Training and Serving your Models from experimentation environment - you run ML Pipelines inside of your Notebook and do that manually at each retraining.
If you are beyond Notebooks you will be running ML Pipelines from CI/CD Pipelines and on Orchestrator triggers.