Kafka is an extremely important ๐๐ถ๐๐๐ฟ๐ถ๐ฏ๐๐๐ฒ๐ฑ ๐ ๐ฒ๐๐๐ฎ๐ด๐ถ๐ป๐ด ๐ฆ๐๐๐๐ฒ๐บ to understand as it was the first of its kind and most of the new products are built on the ideas of Kafka.
โก๏ธ Clients writing to Kafka are called ๐ฃ๐ฟ๐ผ๐ฑ๐๐ฐ๐ฒ๐ฟ๐,
โก๏ธ Clients reading the Data are called ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ๐.
โก๏ธ Data is written into ๐ง๐ผ๐ฝ๐ถ๐ฐ๐ that can be compared to ๐ง๐ฎ๐ฏ๐น๐ฒ๐ ๐ถ๐ป ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ๐.
๐
โก๏ธ Messages sent to Topics are called ๐ฅ๐ฒ๐ฐ๐ผ๐ฟ๐ฑ๐.
โก๏ธ Topics are composed of ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป๐.
โก๏ธ Each Partition behaves like and is a set of ๐ช๐ฟ๐ถ๐๐ฒ ๐๐ต๐ฒ๐ฎ๐ฑ ๐๐ผ๐ด๐.
๐
๐ช๐ฟ๐ถ๐๐ถ๐ป๐ด ๐๐ฎ๐๐ฎ:
โก๏ธ There are two types of records that can be sent to a Topic - ๐๐ผ๐ป๐๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐ฎ ๐๐ฒ๐ ๐ฎ๐ป๐ฑ ๐ช๐ถ๐๐ต๐ผ๐๐ ๐ฎ ๐๐ฒ๐.
โก๏ธ If there is no key, then records are written into Partitions in a ๐ฅ๐ผ๐๐ป๐ฑ ๐ฅ๐ผ๐ฏ๐ถ๐ป ๐ณ๐ฎ๐๐ต๐ถ๐ผ๐ป.
๐
โก๏ธ If there is a key, then records with the same keys will always be written to the ๐ฆ๐ฎ๐บ๐ฒ ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป.
โก๏ธ Data is always written to the ๐๐ป๐ฑ ๐ผ๐ณ ๐๐ต๐ฒ ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป.
๐
โก๏ธ When written, a record gets an ๐ข๐ณ๐ณ๐๐ฒ๐ assigned to it which denotes its ๐ข๐ฟ๐ฑ๐ฒ๐ฟ/๐ฃ๐น๐ฎ๐ฐ๐ฒ ๐ถ๐ป ๐๐ต๐ฒ ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป.
โก๏ธ Partitions have separate sets of Offsets starting from 1.
โก๏ธ Offsets are incremented sequentially when new records are written.
I believe that the following is a correct order to start in ๐ฌ๐ผ๐๐ฟ ๐๐ฎ๐๐ฎ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด ๐ฃ๐ฎ๐๐ต:
๐ Data Extraction
๐ Data Validation
๐ Data Contracts
๐ Loading Data into a DWH / Data Lake
๐ Transformations in a DWH / Data Lake
๐ Scheduling
โก๏ธ Rows on disk are stored in sequence.
โก๏ธ New rows are written efficiently since you can write the entire row at once.
๐
โก๏ธ For select statements that target a subset of columns, reading is slower since you need to scan all sets of rows to retrieve one of the columns.
We have covered lots of concepts around Kafka already. But what are the most common use cases for The System that you are very likely to run into as a Data Engineer?
โก๏ธ The Original use case for Kafka by LinkedIn.
โก๏ธ Events happening in the website like page views, conversions etc. are sent via a Gateway and piped to Kafka Topics.
Usually MLOps Engineers are professionals tasked with building out the ML Platform in the organization.
๐
This means that the skill set required is very broad - naturally very few people start off with the full set of skills you would need to brand yourself as a MLOps Engineer. This is why I would not choose this role if you are just entering the market.
You are very likely to run into a ๐๐ถ๐๐๐ฟ๐ถ๐ฏ๐๐๐ฒ๐ฑ ๐๐ผ๐บ๐ฝ๐๐๐ฒ ๐ฆ๐๐๐๐ฒ๐บ ๐ผ๐ฟ ๐๐ฟ๐ฎ๐บ๐ฒ๐๐ผ๐ฟ๐ธ in your career. It could be ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ, ๐๐ถ๐๐ฒ, ๐ฃ๐ฟ๐ฒ๐๐๐ผ or any other.
๐
Also, it is very likely that these Frameworks would be reading data from a distributed storage. It could be ๐๐๐๐ฆ, ๐ฆ๐ฏ etc.
So how do we implement ๐ฃ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป ๐๐ฟ๐ฎ๐ฑ๐ฒ ๐๐ฎ๐๐ฐ๐ต ๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ in ๐ง๐ต๐ฒ ๐ ๐๐ข๐ฝ๐ ๐ช๐ฎ๐?
๐ญ: Everything starts in version control: Machine Learning Training Pipeline is defined in code, once merged to the main branch it is built and triggered.
๐
๐ฎ: Feature preprocessing stage: Features are retrieved from the Feature Store, validated and passed to the next stage. Any feature related metadata is saved to an Experiment Tracking System.