Aurimas Griciลซnas Profile picture
Mar 17 โ€ข 8 tweets โ€ข 4 min read
What are the basics of Writing Data to a Kafka Topic?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
Kafka is an extremely important ๐——๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ฒ๐—ฑ ๐— ๐—ฒ๐˜€๐˜€๐—ฎ๐—ด๐—ถ๐—ป๐—ด ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ to understand as it was the first of its kind and most of the new products are built on the ideas of Kafka.

๐—ฆ๐—ผ๐—บ๐—ฒ ๐—ด๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐—น ๐—ฑ๐—ฒ๐—ณ๐—ถ๐—ป๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€:

๐Ÿ‘‡
โžก๏ธ Clients writing to Kafka are called ๐—ฃ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐—ฒ๐—ฟ๐˜€,
โžก๏ธ Clients reading the Data are called ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ๐˜€.
โžก๏ธ Data is written into ๐—ง๐—ผ๐—ฝ๐—ถ๐—ฐ๐˜€ that can be compared to ๐—ง๐—ฎ๐—ฏ๐—น๐—ฒ๐˜€ ๐—ถ๐—ป ๐——๐—ฎ๐˜๐—ฎ๐—ฏ๐—ฎ๐˜€๐—ฒ๐˜€.

๐Ÿ‘‡
โžก๏ธ Messages sent to Topics are called ๐—ฅ๐—ฒ๐—ฐ๐—ผ๐—ฟ๐—ฑ๐˜€.
โžก๏ธ Topics are composed of ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€.
โžก๏ธ Each Partition behaves like and is a set of ๐—ช๐—ฟ๐—ถ๐˜๐—ฒ ๐—”๐—ต๐—ฒ๐—ฎ๐—ฑ ๐—Ÿ๐—ผ๐—ด๐˜€.

๐Ÿ‘‡
๐—ช๐—ฟ๐—ถ๐˜๐—ถ๐—ป๐—ด ๐——๐—ฎ๐˜๐—ฎ:

โžก๏ธ There are two types of records that can be sent to a Topic - ๐—–๐—ผ๐—ป๐˜๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด ๐—ฎ ๐—ž๐—ฒ๐˜† ๐—ฎ๐—ป๐—ฑ ๐—ช๐—ถ๐˜๐—ต๐—ผ๐˜‚๐˜ ๐—ฎ ๐—ž๐—ฒ๐˜†.
โžก๏ธ If there is no key, then records are written into Partitions in a ๐—ฅ๐—ผ๐˜‚๐—ป๐—ฑ ๐—ฅ๐—ผ๐—ฏ๐—ถ๐—ป ๐—ณ๐—ฎ๐˜€๐—ต๐—ถ๐—ผ๐—ป.

๐Ÿ‘‡
โžก๏ธ If there is a key, then records with the same keys will always be written to the ๐—ฆ๐—ฎ๐—บ๐—ฒ ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป.
โžก๏ธ Data is always written to the ๐—˜๐—ป๐—ฑ ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป.

๐Ÿ‘‡
โžก๏ธ When written, a record gets an ๐—ข๐—ณ๐—ณ๐˜€๐—ฒ๐˜ assigned to it which denotes its ๐—ข๐—ฟ๐—ฑ๐—ฒ๐—ฟ/๐—ฃ๐—น๐—ฎ๐—ฐ๐—ฒ ๐—ถ๐—ป ๐˜๐—ต๐—ฒ ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป.
โžก๏ธ Partitions have separate sets of Offsets starting from 1.
โžก๏ธ Offsets are incremented sequentially when new records are written.

๐Ÿ‘‡
๐Ÿ‘‹ I am Aurimas.

I will help you Level Up in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space.

๐—™๐—ผ๐—น๐—น๐—ผ๐˜„ ๐—บ๐—ฒ and hit ๐Ÿ””

Join a growing community of 6500+ Data Enthusiasts by subscribing to my ๐—ก๐—ฒ๐˜„๐˜€๐—น๐—ฒ๐˜๐˜๐—ฒ๐—ฟ: newsletter.swirlai.com/p/sai-21-what-โ€ฆ

โ€ข โ€ข โ€ข

Missing some Tweet in this thread? You can try to force a refresh
ใ€€

Keep Current with Aurimas Griciลซnas

Aurimas Griciลซnas Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Aurimas_Gr

Mar 17
What is a correct Data Engineering Learning Path?

My thoughts in the ๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
I believe that the following is a correct order to start in ๐—ฌ๐—ผ๐˜‚๐—ฟ ๐——๐—ฎ๐˜๐—ฎ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ๐—ถ๐—ป๐—ด ๐—ฃ๐—ฎ๐˜๐—ต:

๐Ÿ‘‡
โžก๏ธ ๐—จ๐—ป๐—ฑ๐—ฒ๐—ฟ๐˜€๐˜๐—ฎ๐—ป๐—ฑ ๐—•๐—ฎ๐˜€๐—ถ๐—ฐ ๐—ฃ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€๐—ฒ๐˜€:

๐Ÿ‘‰ Data Extraction
๐Ÿ‘‰ Data Validation
๐Ÿ‘‰ Data Contracts
๐Ÿ‘‰ Loading Data into a DWH / Data Lake
๐Ÿ‘‰ Transformations in a DWH / Data Lake
๐Ÿ‘‰ Scheduling

๐Ÿ‘‡
Read 8 tweets
Mar 16
So what is the difference between Row Based and Column Based file formats?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning
๐—ฅ๐—ผ๐˜„ ๐—•๐—ฎ๐˜€๐—ฒ๐—ฑ:

โžก๏ธ Rows on disk are stored in sequence.
โžก๏ธ New rows are written efficiently since you can write the entire row at once.

๐Ÿ‘‡
โžก๏ธ For select statements that target a subset of columns, reading is slower since you need to scan all sets of rows to retrieve one of the columns.

๐Ÿ‘‡
Read 8 tweets
Mar 15
What are the main use cases for Apache Kafka or any other Distributed Messaging System?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience Image
We have covered lots of concepts around Kafka already. But what are the most common use cases for The System that you are very likely to run into as a Data Engineer?

๐—Ÿ๐—ฒ๐˜โ€™๐˜€ ๐˜๐—ฎ๐—ธ๐—ฒ ๐—ฎ ๐—ฐ๐—น๐—ผ๐˜€๐—ฒ๐—ฟ ๐—น๐—ผ๐—ผ๐—ธ:

๐Ÿ‘‡
๐—ช๐—ฒ๐—ฏ๐˜€๐—ถ๐˜๐—ฒ ๐—”๐—ฐ๐˜๐—ถ๐˜ƒ๐—ถ๐˜๐˜† ๐—ง๐—ฟ๐—ฎ๐—ฐ๐—ธ๐—ถ๐—ป๐—ด.

โžก๏ธ The Original use case for Kafka by LinkedIn.
โžก๏ธ Events happening in the website like page views, conversions etc. are sent via a Gateway and piped to Kafka Topics.

๐Ÿ‘‡
Read 12 tweets
Mar 1
Considering switching to a ๐— ๐—Ÿ๐—ข๐—ฝ๐˜€ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ role?

My thought in the ๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
Usually MLOps Engineers are professionals tasked with building out the ML Platform in the organization.

๐Ÿ‘‡
This means that the skill set required is very broad - naturally very few people start off with the full set of skills you would need to brand yourself as a MLOps Engineer. This is why I would not choose this role if you are just entering the market.

๐Ÿ‘‡
Read 10 tweets
Feb 28
What is the difference between Splittable and Non-Splittable Files?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
You are very likely to run into a ๐——๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ฒ๐—ฑ ๐—–๐—ผ๐—บ๐—ฝ๐˜‚๐˜๐—ฒ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ ๐—ผ๐—ฟ ๐—™๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜„๐—ผ๐—ฟ๐—ธ in your career. It could be ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ, ๐—›๐—ถ๐˜ƒ๐—ฒ, ๐—ฃ๐—ฟ๐—ฒ๐˜€๐˜๐—ผ or any other.

๐Ÿ‘‡
Also, it is very likely that these Frameworks would be reading data from a distributed storage. It could be ๐—›๐——๐—™๐—ฆ, ๐—ฆ๐Ÿฏ etc.

๐Ÿ‘‡
Read 12 tweets
Feb 28
So how do we implement ๐—ฃ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—š๐—ฟ๐—ฎ๐—ฑ๐—ฒ ๐—•๐—ฎ๐˜๐—ฐ๐—ต ๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ in ๐—ง๐—ต๐—ฒ ๐— ๐—Ÿ๐—ข๐—ฝ๐˜€ ๐—ช๐—ฎ๐˜†?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
Letโ€™s zoom in:

๐Ÿญ: Everything starts in version control: Machine Learning Training Pipeline is defined in code, once merged to the main branch it is built and triggered.

๐Ÿ‘‡
๐Ÿฎ: Feature preprocessing stage: Features are retrieved from the Feature Store, validated and passed to the next stage. Any feature related metadata is saved to an Experiment Tracking System.

๐Ÿ‘‡
Read 13 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(