Tweet

Aurimas Griciūnas

Dec 21 • 15 tweets • 5 min read

Do you know what CDC(Change Data Capture) is and that there are multiple ways to implement it?

Find out in the Thread 👇

--------

𝗙𝗼𝗹𝗹𝗼𝘄 𝗺𝗲 and hit 🔔 to 𝗟𝗲𝘃𝗲𝗹 𝗨𝗽 in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space!

𝗖𝗵𝗮𝗻𝗴𝗲 𝗗𝗮𝘁𝗮 𝗖𝗮𝗽𝘁𝘂𝗿𝗲 is a software process used to replicate actions performed against Operational Databases for use in downstream applications.

𝗧𝗵𝗲𝗿𝗲 𝗮𝗿𝗲 𝘀𝗲𝘃𝗲𝗿𝗮𝗹 𝘂𝘀𝗲 𝗰𝗮𝘀𝗲𝘀 𝗳𝗼𝗿 CDC. 𝗧𝘄𝗼 𝗼𝗳 𝘁𝗵𝗲 𝗺𝗮𝗶𝗻 𝗼𝗻𝗲𝘀:

👇

➡️ 𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲 𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 (refer to 3️⃣ in the Diagram).

👉 CDC can be used for moving transactions performed against Source Database to a Target DB. If each transaction is replicated - it is possible to retain all ACID guarantees when performing replication.

👇

👉 Real Time CDC is extremely valuable here as it enables Zero Downtime DB Replication 𝗮𝗻𝗱 and Migration. E.g It is extensively used when migrating 𝗼𝗻-𝗽𝗿𝗲𝗺 𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲𝘀 serving Critical Applications that can not be shut down for a moment to the cloud.

👇

➡️ Facilitation of 𝗗𝗮𝘁𝗮 𝗠𝗼𝘃𝗲𝗺𝗲𝗻𝘁 𝗳𝗿𝗼𝗺 𝗢𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝗮𝗹 𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲𝘀 𝘁𝗼 𝗗𝗮𝘁𝗮 𝗟𝗮𝗸𝗲𝘀 (refer to 1️⃣ in the Diagram) 𝗼𝗿 𝗗𝗮𝘁𝗮 𝗪𝗮𝗿𝗲𝗵𝗼𝘂𝘀𝗲𝘀 (refer to 2️⃣ in the Diagram) 𝗳𝗼𝗿 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 𝗽𝘂𝗿𝗽𝗼𝘀𝗲𝘀.

👇

👉 There are currently two Data movement patterns widely applied in the industry: 𝗘𝗧𝗟 𝗮𝗻𝗱 𝗘𝗟𝗧.
👉 𝗜𝗻 𝘁𝗵𝗲 𝗰𝗮𝘀𝗲 𝗼𝗳 𝗘𝗧𝗟 - data extracted by CDC can be transformed on the fly and eventually pushed to the Data Lake or Data Warehouse.

👇

👉 𝗜𝗻 𝘁𝗵𝗲 𝗰𝗮𝘀𝗲 𝗼𝗳 𝗘𝗟𝗧 - Data is replicated to the Data Lake or Data Warehouse as is and Transformations performed inside of the System.

There is more than one way of how CDC can be implemented, the methods are mainly split into three groups:

👇

➡️ 𝗣𝘂𝗹𝗹 𝗕𝗮𝘀𝗲𝗱 𝗖𝗗𝗖

👉 A client queries the Source Database and pushes data into the Target Database.

❗️Downside 1: There is a need to augment all of the source tables to include indicators that a record has changed.

👇

❗️Downside 2: Usually - not a real time CDC, it might be performed hourly, daily etc.
❗️Downside 3: Source Database suffers high load when CDC is being performed.
❗️Downside 4: It is extremely challenging to replicate Delete events.

👇

➡️ 𝗣𝘂𝘀𝗵 𝗕𝗮𝘀𝗲𝗱 𝗖𝗗𝗖

👉 Triggers are set up in the Source Database. Whenever a change event happens in the Database - it is pushed to a target system.

👇

❗️ Downside 1: This approach usually causes highest database load overhead.
✅ Upside 1: Real Time CDC.

👇

➡️ 𝗟𝗼𝗴 𝗕𝗮𝘀𝗲𝗱 𝗖𝗗𝗖

👉 Transactional Databases have all of the events performed against the Database logged in the transaction log for recovery purposes.

👇

👉 A Transaction Miner is mounted on top of the logs and pushes selected events into a Downstream System. Popular implementation - Debezium.

👇

❗️ Downside 1: More complicated to set up.
❗️ Downside 2: Not all Databases will have open source connectors.
✅ Upside 1: Least load on the Database.
✅ Upside 2: Real Time CDC.

👇

I will help you Level Up in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space.

𝗙𝗼𝗹𝗹𝗼𝘄 𝗺𝗲 and hit 🔔

Join a growing community of 3000+ Data Enthusiasts by subscribing to my 𝗡𝗲𝘄𝘀𝗹𝗲𝘁𝘁𝗲𝗿: swirlai.substack.com/p/sai-10-airfl…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @Aurimas_Gr

Aurimas Griciūnas

@Aurimas_Gr

Dec 23

If I could only choose 5 books to read in 2023 as an aspiring Data Engineer these would be them in a specific order:

Read on in the Thread 👇

--------

Follow me and hit 🔔 to 𝗟𝗲𝘃𝗲𝗹 𝗨𝗽 in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space!

1️⃣ ”𝗙𝘂𝗻𝗱𝗮𝗺𝗲𝗻𝘁𝗮𝗹𝘀 𝗼𝗳 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴” - A book that I wish I had 5 years ago. After reading it you will understand the entire Data Engineering workflow. It will prepare you for further deep dives.

👇

2️⃣ ”𝗔𝗰𝗰𝗲𝗹𝗲𝗿𝗮𝘁𝗲” - Data Engineers should follow the same practices that Software Engineers do and more. After reading this book you will understand DevOps practices in and out.

👇

Read 9 tweets

Aurimas Griciūnas

@Aurimas_Gr

Dec 22

What is a 𝗙𝗲𝗮𝘁𝘂𝗿𝗲 𝗦𝘁𝗼𝗿𝗲 and why is it such an important element in 𝗠𝗟𝗢𝗽𝘀 𝗦𝘁𝗮𝗰𝗸?

Find out in the Thread 👇

--------

𝗙𝗼𝗹𝗹𝗼𝘄 𝗺𝗲 and hit 🔔 to 𝗟𝗲𝘃𝗲𝗹 𝗨𝗽 in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space!

Feature Store System sits between Data Engineering and Machine Learning Pipelines and it solves the following issues:

➡️ Eliminates Training/Serving skew by syncing Batch and Online Serving Storages (5)

👇

➡️ Enables Feature Sharing and Discoverability through the Metadata Layer - you define the Feature Transformations once, enable discoverability through the Feature Catalog and then serve Feature Sets for training and inference purposes trough unified interface (4️,3).

👇

Read 15 tweets

Aurimas Griciūnas

@Aurimas_Gr

Dec 21

What does good Model Tracking System look like?

Find out in the Thread 👇

--------

𝗙𝗼𝗹𝗹𝗼𝘄 𝗺𝗲 and hit 🔔 to 𝗟𝗲𝘃𝗲𝗹 𝗨𝗽 in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space!

It should be composed of two integrated parts: Experiment Tracking System and a Model Registry.

From where you track ML Pipeline metadata will depend on MLOps maturity in your company.

If you are at the beginning of the ML journey you might be:

👇

1️⃣ Training and Serving your Models from experimentation environment - you run ML Pipelines inside of your Notebook and do that manually at each retraining.

If you are beyond Notebooks you will be running ML Pipelines from CI/CD Pipelines and on Orchestrator triggers.

👇

Read 14 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Aurimas Griciūnas

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @Aurimas_Gr

Aurimas Griciūnas

Aurimas Griciūnas

Aurimas Griciūnas

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!