Tweet

Aurimas Griciūnas

Follow @Aurimas_Gr

Jan 31 • 15 tweets • 5 min read

What are 𝗟𝗮𝗺𝗯𝗱𝗮 𝗮𝗻𝗱 𝗞𝗮𝗽𝗽𝗮 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲𝘀?

🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

Lambda and Kappa are both Data architectures proposed to solve movement of large amounts of data for reliable Online access.

👇

The most popular architecture has been and continues to be Lambda. However, with Stream Processing becoming more accessible to organizations of every size you will be hearing a lot more of Kappa in the near future. Let’s see how they are different.

👇

𝗟𝗮𝗺𝗯𝗱𝗮.

➡️ Ingestion layer is responsible for collecting the raw data and duplicating it for further Real Time and Batch processing separately.
➡️ Consists of 3 additional main layers:

👇

👉 Speed or Stream - Raw Data here is coming in Real Time and is processed by a Stream Processing Framework (e.g. Flink) then passed to the Serving layer to create Real Time Views for low latency near to Real Time Data access.

👇

👉 Batch - Batch ETL Jobs with batch processing Frameworks (e.g. Spark) are run against raw Data to create reliable Batch Views for Offline Historical Data access.

👇

👉 Serving - this is where the processed Data is exposed to the end user. Latest Real Time Data can be accessed from Real Time Views or combined with Batch Views for full history. Historical Data can be accessed from Batch Views.

👇

❗️ Processing code is duplicated for different technologies in Batch and Speed Layers causing logic divergence.
❗️ Compute resources are duplicated.

👇

❗️ Need to manage two Infrastructures.
✅ Distributed Batch Storage is reliable and scalable, even if the System crashes it is easily recoverable without errors.

👇

𝗞𝗮𝗽𝗽𝗮.

➡️ Treats both Batch and Real Time Workloads as a Stream Processing problem.
➡️ Uses Speed Layer only to prepare data for Real Time and Batch Access.
➡️ Consists of only 2 main layers:

👇

👉 Speed or Stream - similar to Lambda but (optionally) often contains Tiered Storage which means that all of Data coming into the system is stored indefinitely in different Storage Layers. E.g. S3 or GCS for historical data and on disk log for hot data.

👇

👉 Serving - same as Lambda but all transformations that are performed in Speed Layer are never duplicated in Batch Layer.

❗️ Some transformations are hard to perform in Speed Layer (e.g. complex joins) and are eventually pushed to Batch storage for implementation.

👇

❗️ Requires strong skills in Stream Processing.
✅ Data is processed once with a single Stream Processing Engine.
✅ Only need to manage single set of Infrastructure.

👇

👋 I am Aurimas.

I will help you Level Up in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space.

𝗙𝗼𝗹𝗹𝗼𝘄 𝗺𝗲 and hit 🔔

Join a growing community of 5100+ Data Enthusiasts by subscribing to my 𝗡𝗲𝘄𝘀𝗹𝗲𝘁𝘁𝗲𝗿: swirlai.substack.com/p/sai-15-whats…

Also join the SwirlAI Data Talent Collective that will help you find your next career step: swirlai.pallet.com/talent/welcome…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @Aurimas_Gr

Aurimas Griciūnas

@Aurimas_Gr

Feb 1

𝗡𝗼 𝗘𝘅𝗰𝘂𝘀𝗲𝘀 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗣𝗼𝗿𝘁𝗳𝗼𝗹𝗶𝗼 𝗧𝗲𝗺𝗽𝗹𝗮𝘁𝗲 - next week I will enrich it with the missing Machine Learning and MLOps parts!

🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

Today - let’s review it once more. It is super helpful as these kind of Data Architectures are what you will find in real life situations.

𝗥𝗲𝗰𝗮𝗽:

👇

𝟭. Data Producers - Python Applications that extract data from chosen Data Sources and push it to Collector via REST or gRPC API calls.

👇

Read 14 tweets

Aurimas Griciūnas

@Aurimas_Gr

Jan 30

Let’s remind ourselves of how a 𝗥𝗲𝗾𝘂𝗲𝘀𝘁-𝗥𝗲𝘀𝗽𝗼𝗻𝘀𝗲 𝗠𝗼𝗱𝗲𝗹 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁 looks like - 𝗧𝗵𝗲 𝗠𝗟𝗢𝗽𝘀 𝗪𝗮𝘆.

🧵

#MLOps #MachineLearning #DataScience #Data

You will find this type of model deployment to be the most popular when it comes to Online Machine Learning Systems.

Let's zoom in:

𝟭: Version Control: Machine Learning Training Pipeline is defined in code, once merged to the main branch it is built and triggered.

👇

𝟮: Feature Preprocessing: Features are retrieved from the Feature Store, validated and passed to the next stage. Any feature related metadata that is tightly coupled to the Model being trained is saved to the Experiment Tracking System.

👇

Read 14 tweets

Aurimas Griciūnas

@Aurimas_Gr

Dec 23, 2022

If I could only choose 5 books to read in 2023 as an aspiring Data Engineer these would be them in a specific order:

Read on in the Thread 👇

--------

Follow me and hit 🔔 to 𝗟𝗲𝘃𝗲𝗹 𝗨𝗽 in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space!

1️⃣ ”𝗙𝘂𝗻𝗱𝗮𝗺𝗲𝗻𝘁𝗮𝗹𝘀 𝗼𝗳 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴” - A book that I wish I had 5 years ago. After reading it you will understand the entire Data Engineering workflow. It will prepare you for further deep dives.

👇

2️⃣ ”𝗔𝗰𝗰𝗲𝗹𝗲𝗿𝗮𝘁𝗲” - Data Engineers should follow the same practices that Software Engineers do and more. After reading this book you will understand DevOps practices in and out.

👇

Read 9 tweets

Aurimas Griciūnas

@Aurimas_Gr

Dec 22, 2022

What is a 𝗙𝗲𝗮𝘁𝘂𝗿𝗲 𝗦𝘁𝗼𝗿𝗲 and why is it such an important element in 𝗠𝗟𝗢𝗽𝘀 𝗦𝘁𝗮𝗰𝗸?

Find out in the Thread 👇

--------

𝗙𝗼𝗹𝗹𝗼𝘄 𝗺𝗲 and hit 🔔 to 𝗟𝗲𝘃𝗲𝗹 𝗨𝗽 in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space!

Feature Store System sits between Data Engineering and Machine Learning Pipelines and it solves the following issues:

➡️ Eliminates Training/Serving skew by syncing Batch and Online Serving Storages (5)

👇

➡️ Enables Feature Sharing and Discoverability through the Metadata Layer - you define the Feature Transformations once, enable discoverability through the Feature Catalog and then serve Feature Sets for training and inference purposes trough unified interface (4️,3).

👇

Read 15 tweets

Aurimas Griciūnas

@Aurimas_Gr

Dec 21, 2022

Do you know what CDC(Change Data Capture) is and that there are multiple ways to implement it?

Find out in the Thread 👇

--------

𝗙𝗼𝗹𝗹𝗼𝘄 𝗺𝗲 and hit 🔔 to 𝗟𝗲𝘃𝗲𝗹 𝗨𝗽 in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space!

𝗖𝗵𝗮𝗻𝗴𝗲 𝗗𝗮𝘁𝗮 𝗖𝗮𝗽𝘁𝘂𝗿𝗲 is a software process used to replicate actions performed against Operational Databases for use in downstream applications.

𝗧𝗵𝗲𝗿𝗲 𝗮𝗿𝗲 𝘀𝗲𝘃𝗲𝗿𝗮𝗹 𝘂𝘀𝗲 𝗰𝗮𝘀𝗲𝘀 𝗳𝗼𝗿 CDC. 𝗧𝘄𝗼 𝗼𝗳 𝘁𝗵𝗲 𝗺𝗮𝗶𝗻 𝗼𝗻𝗲𝘀:

👇

➡️ 𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲 𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 (refer to 3️⃣ in the Diagram).

👉 CDC can be used for moving transactions performed against Source Database to a Target DB. If each transaction is replicated - it is possible to retain all ACID guarantees when performing replication.

👇

Read 15 tweets

Aurimas Griciūnas

@Aurimas_Gr

Dec 21, 2022

What does good Model Tracking System look like?

Find out in the Thread 👇

--------

𝗙𝗼𝗹𝗹𝗼𝘄 𝗺𝗲 and hit 🔔 to 𝗟𝗲𝘃𝗲𝗹 𝗨𝗽 in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space!

It should be composed of two integrated parts: Experiment Tracking System and a Model Registry.

From where you track ML Pipeline metadata will depend on MLOps maturity in your company.

If you are at the beginning of the ML journey you might be:

👇

1️⃣ Training and Serving your Models from experimentation environment - you run ML Pipelines inside of your Notebook and do that manually at each retraining.

If you are beyond Notebooks you will be running ML Pipelines from CI/CD Pipelines and on Orchestrator triggers.

👇

Read 14 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Aurimas Griciūnas

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @Aurimas_Gr

Aurimas Griciūnas

Aurimas Griciūnas

Aurimas Griciūnas

Aurimas Griciūnas

Aurimas Griciūnas

Aurimas Griciūnas

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!