๐๐ต๐ฎ๐ป๐ด๐ฒ ๐๐ฎ๐๐ฎ ๐๐ฎ๐ฝ๐๐๐ฟ๐ฒ is a software process used to replicate actions performed against Operational Databases for use in downstream applications.
โก๏ธ ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ ๐ฅ๐ฒ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป (refer to 3๏ธโฃ in the Diagram).
๐ CDC can be used for moving transactions performed against Source Database to a Target DB. If each transaction is replicated - it is possible to retain all ACID guarantees when performing replication.
๐
๐ Real Time CDC is extremely valuable here as it enables Zero Downtime DB Replication ๐ฎ๐ป๐ฑ and Migration. E.g It is extensively used when migrating ๐ผ๐ป-๐ฝ๐ฟ๐ฒ๐บ ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ๐ serving Critical Applications that can not be shut down for a moment to the cloud.
๐
โก๏ธ Facilitation of ๐๐ฎ๐๐ฎ ๐ ๐ผ๐๐ฒ๐บ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐ข๐ฝ๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป๐ฎ๐น ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ๐ ๐๐ผ ๐๐ฎ๐๐ฎ ๐๐ฎ๐ธ๐ฒ๐ (refer to 1๏ธโฃ in the Diagram) ๐ผ๐ฟ ๐๐ฎ๐๐ฎ ๐ช๐ฎ๐ฟ๐ฒ๐ต๐ผ๐๐๐ฒ๐ (refer to 2๏ธโฃ in the Diagram) ๐ณ๐ผ๐ฟ ๐๐ป๐ฎ๐น๐๐๐ถ๐ฐ๐ ๐ฝ๐๐ฟ๐ฝ๐ผ๐๐ฒ๐.
๐
๐ There are currently two Data movement patterns widely applied in the industry: ๐๐ง๐ ๐ฎ๐ป๐ฑ ๐๐๐ง.
๐ ๐๐ป ๐๐ต๐ฒ ๐ฐ๐ฎ๐๐ฒ ๐ผ๐ณ ๐๐ง๐ - data extracted by CDC can be transformed on the fly and eventually pushed to the Data Lake or Data Warehouse.
๐
๐ ๐๐ป ๐๐ต๐ฒ ๐ฐ๐ฎ๐๐ฒ ๐ผ๐ณ ๐๐๐ง - Data is replicated to the Data Lake or Data Warehouse as is and Transformations performed inside of the System.
There is more than one way of how CDC can be implemented, the methods are mainly split into three groups:
๐ A client queries the Source Database and pushes data into the Target Database.
โ๏ธDownside 1: There is a need to augment all of the source tables to include indicators that a record has changed.
๐
โ๏ธDownside 2: Usually - not a real time CDC, it might be performed hourly, daily etc.
โ๏ธDownside 3: Source Database suffers high load when CDC is being performed.
โ๏ธDownside 4: It is extremely challenging to replicate Delete events.
๐ Transactional Databases have all of the events performed against the Database logged in the transaction log for recovery purposes.
๐
๐ A Transaction Miner is mounted on top of the logs and pushes selected events into a Downstream System. Popular implementation - Debezium.
๐
โ๏ธ Downside 1: More complicated to set up.
โ๏ธ Downside 2: Not all Databases will have open source connectors.
โ Upside 1: Least load on the Database.
โ Upside 2: Real Time CDC.
Join a growing community of 3000+ Data Enthusiasts by subscribing to my ๐ก๐ฒ๐๐๐น๐ฒ๐๐๐ฒ๐ฟ: swirlai.substack.com/p/sai-10-airflโฆ
โข โข โข
Missing some Tweet in this thread? You can try to
force a refresh
1๏ธโฃ โ๐๐๐ป๐ฑ๐ฎ๐บ๐ฒ๐ป๐๐ฎ๐น๐ ๐ผ๐ณ ๐๐ฎ๐๐ฎ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ดโ - A book that I wish I had 5 years ago. After reading it you will understand the entire Data Engineering workflow. It will prepare you for further deep dives.
๐
2๏ธโฃ โ๐๐ฐ๐ฐ๐ฒ๐น๐ฒ๐ฟ๐ฎ๐๐ฒโ - Data Engineers should follow the same practices that Software Engineers do and more. After reading this book you will understand DevOps practices in and out.
Feature Store System sits between Data Engineering and Machine Learning Pipelines and it solves the following issues:
โก๏ธ Eliminates Training/Serving skew by syncing Batch and Online Serving Storages (5)
๐
โก๏ธ Enables Feature Sharing and Discoverability through the Metadata Layer - you define the Feature Transformations once, enable discoverability through the Feature Catalog and then serve Feature Sets for training and inference purposes trough unified interface (4๏ธ,3).
It should be composed of two integrated parts: Experiment Tracking System and a Model Registry.
From where you track ML Pipeline metadata will depend on MLOps maturity in your company.
If you are at the beginning of the ML journey you might be:
๐
1๏ธโฃ Training and Serving your Models from experimentation environment - you run ML Pipelines inside of your Notebook and do that manually at each retraining.
If you are beyond Notebooks you will be running ML Pipelines from CI/CD Pipelines and on Orchestrator triggers.