Tweet

Aurimas Griciūnas

Follow @Aurimas_Gr

Mar 17 • 8 tweets • 4 min read

What are the basics of Writing Data to a Kafka Topic?

🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

Kafka is an extremely important 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝗱 𝗠𝗲𝘀𝘀𝗮𝗴𝗶𝗻𝗴 𝗦𝘆𝘀𝘁𝗲𝗺 to understand as it was the first of its kind and most of the new products are built on the ideas of Kafka.

𝗦𝗼𝗺𝗲 𝗴𝗲𝗻𝗲𝗿𝗮𝗹 𝗱𝗲𝗳𝗶𝗻𝗶𝘁𝗶𝗼𝗻𝘀:

👇

➡️ Clients writing to Kafka are called 𝗣𝗿𝗼𝗱𝘂𝗰𝗲𝗿𝘀,
➡️ Clients reading the Data are called 𝗖𝗼𝗻𝘀𝘂𝗺𝗲𝗿𝘀.
➡️ Data is written into 𝗧𝗼𝗽𝗶𝗰𝘀 that can be compared to 𝗧𝗮𝗯𝗹𝗲𝘀 𝗶𝗻 𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲𝘀.

👇

➡️ Messages sent to Topics are called 𝗥𝗲𝗰𝗼𝗿𝗱𝘀.
➡️ Topics are composed of 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝘀.
➡️ Each Partition behaves like and is a set of 𝗪𝗿𝗶𝘁𝗲 𝗔𝗵𝗲𝗮𝗱 𝗟𝗼𝗴𝘀.

👇

𝗪𝗿𝗶𝘁𝗶𝗻𝗴 𝗗𝗮𝘁𝗮:

➡️ There are two types of records that can be sent to a Topic - 𝗖𝗼𝗻𝘁𝗮𝗶𝗻𝗶𝗻𝗴 𝗮 𝗞𝗲𝘆 𝗮𝗻𝗱 𝗪𝗶𝘁𝗵𝗼𝘂𝘁 𝗮 𝗞𝗲𝘆.
➡️ If there is no key, then records are written into Partitions in a 𝗥𝗼𝘂𝗻𝗱 𝗥𝗼𝗯𝗶𝗻 𝗳𝗮𝘀𝗵𝗶𝗼𝗻.

👇

➡️ If there is a key, then records with the same keys will always be written to the 𝗦𝗮𝗺𝗲 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻.
➡️ Data is always written to the 𝗘𝗻𝗱 𝗼𝗳 𝘁𝗵𝗲 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻.

👇

➡️ When written, a record gets an 𝗢𝗳𝗳𝘀𝗲𝘁 assigned to it which denotes its 𝗢𝗿𝗱𝗲𝗿/𝗣𝗹𝗮𝗰𝗲 𝗶𝗻 𝘁𝗵𝗲 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻.
➡️ Partitions have separate sets of Offsets starting from 1.
➡️ Offsets are incremented sequentially when new records are written.

👇

👋 I am Aurimas.

I will help you Level Up in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space.

𝗙𝗼𝗹𝗹𝗼𝘄 𝗺𝗲 and hit 🔔

Join a growing community of 6500+ Data Enthusiasts by subscribing to my 𝗡𝗲𝘄𝘀𝗹𝗲𝘁𝘁𝗲𝗿: newsletter.swirlai.com/p/sai-21-what-…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @Aurimas_Gr

Aurimas Griciūnas

@Aurimas_Gr

Mar 17

What is a correct Data Engineering Learning Path?

My thoughts in the 🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

I believe that the following is a correct order to start in 𝗬𝗼𝘂𝗿 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗣𝗮𝘁𝗵:

👇

➡️ 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱 𝗕𝗮𝘀𝗶𝗰 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗲𝘀:

👉 Data Extraction
👉 Data Validation
👉 Data Contracts
👉 Loading Data into a DWH / Data Lake
👉 Transformations in a DWH / Data Lake
👉 Scheduling

👇

Read 8 tweets

Aurimas Griciūnas

@Aurimas_Gr

Mar 16

So what is the difference between Row Based and Column Based file formats?

🧵

#Data #DataEngineering #MLOps #MachineLearning

𝗥𝗼𝘄 𝗕𝗮𝘀𝗲𝗱:

➡️ Rows on disk are stored in sequence.
➡️ New rows are written efficiently since you can write the entire row at once.

👇

➡️ For select statements that target a subset of columns, reading is slower since you need to scan all sets of rows to retrieve one of the columns.

👇

Read 8 tweets

Aurimas Griciūnas

@Aurimas_Gr

Mar 15

What are the main use cases for Apache Kafka or any other Distributed Messaging System?

🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

We have covered lots of concepts around Kafka already. But what are the most common use cases for The System that you are very likely to run into as a Data Engineer?

𝗟𝗲𝘁’𝘀 𝘁𝗮𝗸𝗲 𝗮 𝗰𝗹𝗼𝘀𝗲𝗿 𝗹𝗼𝗼𝗸:

👇

𝗪𝗲𝗯𝘀𝗶𝘁𝗲 𝗔𝗰𝘁𝗶𝘃𝗶𝘁𝘆 𝗧𝗿𝗮𝗰𝗸𝗶𝗻𝗴.

➡️ The Original use case for Kafka by LinkedIn.
➡️ Events happening in the website like page views, conversions etc. are sent via a Gateway and piped to Kafka Topics.

👇

Read 12 tweets

Aurimas Griciūnas

@Aurimas_Gr

Mar 1

Considering switching to a 𝗠𝗟𝗢𝗽𝘀 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿 role?

My thought in the 🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

Usually MLOps Engineers are professionals tasked with building out the ML Platform in the organization.

👇

This means that the skill set required is very broad - naturally very few people start off with the full set of skills you would need to brand yourself as a MLOps Engineer. This is why I would not choose this role if you are just entering the market.

👇

Read 10 tweets

Aurimas Griciūnas

@Aurimas_Gr

Feb 28

What is the difference between Splittable and Non-Splittable Files?

🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

You are very likely to run into a 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝗱 𝗖𝗼𝗺𝗽𝘂𝘁𝗲 𝗦𝘆𝘀𝘁𝗲𝗺 𝗼𝗿 𝗙𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸 in your career. It could be 𝗦𝗽𝗮𝗿𝗸, 𝗛𝗶𝘃𝗲, 𝗣𝗿𝗲𝘀𝘁𝗼 or any other.

👇

Also, it is very likely that these Frameworks would be reading data from a distributed storage. It could be 𝗛𝗗𝗙𝗦, 𝗦𝟯 etc.

👇

Read 12 tweets

Aurimas Griciūnas

@Aurimas_Gr

Feb 28

So how do we implement 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗚𝗿𝗮𝗱𝗲 𝗕𝗮𝘁𝗰𝗵 𝗠𝗮𝗰𝗵𝗶𝗻𝗲 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗠𝗼𝗱𝗲𝗹 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲 in 𝗧𝗵𝗲 𝗠𝗟𝗢𝗽𝘀 𝗪𝗮𝘆?

🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

Let’s zoom in:

𝟭: Everything starts in version control: Machine Learning Training Pipeline is defined in code, once merged to the main branch it is built and triggered.

👇

𝟮: Feature preprocessing stage: Features are retrieved from the Feature Store, validated and passed to the next stage. Any feature related metadata is saved to an Experiment Tracking System.

👇

Read 13 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Aurimas Griciūnas

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @Aurimas_Gr

Aurimas Griciūnas

Aurimas Griciūnas

Aurimas Griciūnas

Aurimas Griciūnas

Aurimas Griciūnas

Aurimas Griciūnas

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!