Tweet

Aurimas Griciūnas

Mar 15 • 12 tweets • 4 min read

What are the main use cases for Apache Kafka or any other Distributed Messaging System?

🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

We have covered lots of concepts around Kafka already. But what are the most common use cases for The System that you are very likely to run into as a Data Engineer?

𝗟𝗲𝘁’𝘀 𝘁𝗮𝗸𝗲 𝗮 𝗰𝗹𝗼𝘀𝗲𝗿 𝗹𝗼𝗼𝗸:

👇

𝗪𝗲𝗯𝘀𝗶𝘁𝗲 𝗔𝗰𝘁𝗶𝘃𝗶𝘁𝘆 𝗧𝗿𝗮𝗰𝗸𝗶𝗻𝗴.

➡️ The Original use case for Kafka by LinkedIn.
➡️ Events happening in the website like page views, conversions etc. are sent via a Gateway and piped to Kafka Topics.

👇

➡️ These events are forwarded to the downstream Analytical systems or processed in Real Time.
➡️ Kafka is used as an initial buffer as the Data amounts are usually big and Kafka guarantees no message loss due to its replication mechanisms.

👇

𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲 𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻.

➡️ Database Commit log is piped to a Kafka topic.
➡️ The committed messages are executed against a new Database in the same order.
➡️ Database replica is created.

👇

𝗟𝗼𝗴/𝗠𝗲𝘁𝗿𝗶𝗰𝘀 𝗔𝗴𝗴𝗿𝗲𝗴𝗮𝘁𝗶𝗼𝗻.

➡️ Kafka is used for centralized Log and Metrics collection.
➡️ Daemons like FluentD are deployed in servers or containers together with the Applications to be monitored.
➡️ Applications send their Logs/Metrics to the Daemons.

👇

➡️ The Daemons pipe Logs/Metrics to a Kafka Topic.
➡️ Logs/Metrics are delivered downstream to storages like ElasticSearch or InfluxDB for Log/Metrics discovery respectively.
➡️ This is also how you would track your IoT Fleets.

👇

𝗦𝘁𝗿𝗲𝗮𝗺 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴.

➡️ This is usually coupled with ingestion mechanisms already covered.
➡️ Instead of piping Data to a certain storage downstream we mount a Stream Processing Framework on top of Kafka Topics.

👇

➡️ The Data is filtered, enriched and then piped to the downstream systems to be further used according to the use case.
➡️ This is also where one would be running Machine Learning Models embedded into a Stream Processing Application.

👇

𝗠𝗲𝘀𝘀𝗮𝗴𝗶𝗻𝗴.

➡️ Kafka can be used as a replacement for more traditional messaging brokers like RabbitMQ.
➡️ Kafka has better durability guarantees and is easier to configure for several separate Consumer Groups to consume from the same Topic.

👇

❗️Having said this - always consider the complexity you are bringing with introduction of a Distributed System. Sometimes it is better to just use traditional frameworks.

👇

👋 I am Aurimas.

I will help you Level Up in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space.

𝗙𝗼𝗹𝗹𝗼𝘄 𝗺𝗲 and hit 🔔

Join a growing community of 6500+ Data Enthusiasts by subscribing to my 𝗡𝗲𝘄𝘀𝗹𝗲𝘁𝘁𝗲𝗿: newsletter.swirlai.com/p/sai-21-what-…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @Aurimas_Gr

Aurimas Griciūnas

@Aurimas_Gr

Mar 17

What is a correct Data Engineering Learning Path?

My thoughts in the 🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

I believe that the following is a correct order to start in 𝗬𝗼𝘂𝗿 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗣𝗮𝘁𝗵:

👇

➡️ 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱 𝗕𝗮𝘀𝗶𝗰 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗲𝘀:

👉 Data Extraction
👉 Data Validation
👉 Data Contracts
👉 Loading Data into a DWH / Data Lake
👉 Transformations in a DWH / Data Lake
👉 Scheduling

👇

Read 8 tweets

Aurimas Griciūnas

@Aurimas_Gr

Mar 17

What are the basics of Writing Data to a Kafka Topic?

🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

Kafka is an extremely important 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝗱 𝗠𝗲𝘀𝘀𝗮𝗴𝗶𝗻𝗴 𝗦𝘆𝘀𝘁𝗲𝗺 to understand as it was the first of its kind and most of the new products are built on the ideas of Kafka.

𝗦𝗼𝗺𝗲 𝗴𝗲𝗻𝗲𝗿𝗮𝗹 𝗱𝗲𝗳𝗶𝗻𝗶𝘁𝗶𝗼𝗻𝘀:

👇

➡️ Clients writing to Kafka are called 𝗣𝗿𝗼𝗱𝘂𝗰𝗲𝗿𝘀,
➡️ Clients reading the Data are called 𝗖𝗼𝗻𝘀𝘂𝗺𝗲𝗿𝘀.
➡️ Data is written into 𝗧𝗼𝗽𝗶𝗰𝘀 that can be compared to 𝗧𝗮𝗯𝗹𝗲𝘀 𝗶𝗻 𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲𝘀.

👇

Read 8 tweets

Aurimas Griciūnas

@Aurimas_Gr

Mar 16

So what is the difference between Row Based and Column Based file formats?

🧵

#Data #DataEngineering #MLOps #MachineLearning

𝗥𝗼𝘄 𝗕𝗮𝘀𝗲𝗱:

➡️ Rows on disk are stored in sequence.
➡️ New rows are written efficiently since you can write the entire row at once.

👇

➡️ For select statements that target a subset of columns, reading is slower since you need to scan all sets of rows to retrieve one of the columns.

👇

Read 8 tweets

Aurimas Griciūnas

@Aurimas_Gr

Mar 1

Considering switching to a 𝗠𝗟𝗢𝗽𝘀 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿 role?

My thought in the 🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

Usually MLOps Engineers are professionals tasked with building out the ML Platform in the organization.

👇

This means that the skill set required is very broad - naturally very few people start off with the full set of skills you would need to brand yourself as a MLOps Engineer. This is why I would not choose this role if you are just entering the market.

👇

Read 10 tweets

Aurimas Griciūnas

@Aurimas_Gr

Feb 28

What is the difference between Splittable and Non-Splittable Files?

🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

You are very likely to run into a 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝗱 𝗖𝗼𝗺𝗽𝘂𝘁𝗲 𝗦𝘆𝘀𝘁𝗲𝗺 𝗼𝗿 𝗙𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸 in your career. It could be 𝗦𝗽𝗮𝗿𝗸, 𝗛𝗶𝘃𝗲, 𝗣𝗿𝗲𝘀𝘁𝗼 or any other.

👇

Also, it is very likely that these Frameworks would be reading data from a distributed storage. It could be 𝗛𝗗𝗙𝗦, 𝗦𝟯 etc.

👇

Read 12 tweets

Aurimas Griciūnas

@Aurimas_Gr

Feb 28

So how do we implement 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗚𝗿𝗮𝗱𝗲 𝗕𝗮𝘁𝗰𝗵 𝗠𝗮𝗰𝗵𝗶𝗻𝗲 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗠𝗼𝗱𝗲𝗹 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲 in 𝗧𝗵𝗲 𝗠𝗟𝗢𝗽𝘀 𝗪𝗮𝘆?

🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

Let’s zoom in:

𝟭: Everything starts in version control: Machine Learning Training Pipeline is defined in code, once merged to the main branch it is built and triggered.

👇

𝟮: Feature preprocessing stage: Features are retrieved from the Feature Store, validated and passed to the next stage. Any feature related metadata is saved to an Experiment Tracking System.

👇

Read 13 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Aurimas Griciūnas

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @Aurimas_Gr

Aurimas Griciūnas

Aurimas Griciūnas

Aurimas Griciūnas

Aurimas Griciūnas

Aurimas Griciūnas

Aurimas Griciūnas

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!