We have covered lots of concepts around Kafka already. But what are the most common use cases for The System that you are very likely to run into as a Data Engineer?
โก๏ธ The Original use case for Kafka by LinkedIn.
โก๏ธ Events happening in the website like page views, conversions etc. are sent via a Gateway and piped to Kafka Topics.
๐
โก๏ธ These events are forwarded to the downstream Analytical systems or processed in Real Time.
โก๏ธ Kafka is used as an initial buffer as the Data amounts are usually big and Kafka guarantees no message loss due to its replication mechanisms.
โก๏ธ Database Commit log is piped to a Kafka topic.
โก๏ธ The committed messages are executed against a new Database in the same order.
โก๏ธ Database replica is created.
โก๏ธ Kafka is used for centralized Log and Metrics collection.
โก๏ธ Daemons like FluentD are deployed in servers or containers together with the Applications to be monitored.
โก๏ธ Applications send their Logs/Metrics to the Daemons.
๐
โก๏ธ The Daemons pipe Logs/Metrics to a Kafka Topic.
โก๏ธ Logs/Metrics are delivered downstream to storages like ElasticSearch or InfluxDB for Log/Metrics discovery respectively.
โก๏ธ This is also how you would track your IoT Fleets.
โก๏ธ This is usually coupled with ingestion mechanisms already covered.
โก๏ธ Instead of piping Data to a certain storage downstream we mount a Stream Processing Framework on top of Kafka Topics.
๐
โก๏ธ The Data is filtered, enriched and then piped to the downstream systems to be further used according to the use case.
โก๏ธ This is also where one would be running Machine Learning Models embedded into a Stream Processing Application.
๐
๐ ๐ฒ๐๐๐ฎ๐ด๐ถ๐ป๐ด.
โก๏ธ Kafka can be used as a replacement for more traditional messaging brokers like RabbitMQ.
โก๏ธ Kafka has better durability guarantees and is easier to configure for several separate Consumer Groups to consume from the same Topic.
๐
โ๏ธHaving said this - always consider the complexity you are bringing with introduction of a Distributed System. Sometimes it is better to just use traditional frameworks.
I believe that the following is a correct order to start in ๐ฌ๐ผ๐๐ฟ ๐๐ฎ๐๐ฎ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด ๐ฃ๐ฎ๐๐ต:
๐ Data Extraction
๐ Data Validation
๐ Data Contracts
๐ Loading Data into a DWH / Data Lake
๐ Transformations in a DWH / Data Lake
๐ Scheduling
Kafka is an extremely important ๐๐ถ๐๐๐ฟ๐ถ๐ฏ๐๐๐ฒ๐ฑ ๐ ๐ฒ๐๐๐ฎ๐ด๐ถ๐ป๐ด ๐ฆ๐๐๐๐ฒ๐บ to understand as it was the first of its kind and most of the new products are built on the ideas of Kafka.
โก๏ธ Clients writing to Kafka are called ๐ฃ๐ฟ๐ผ๐ฑ๐๐ฐ๐ฒ๐ฟ๐,
โก๏ธ Clients reading the Data are called ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ๐.
โก๏ธ Data is written into ๐ง๐ผ๐ฝ๐ถ๐ฐ๐ that can be compared to ๐ง๐ฎ๐ฏ๐น๐ฒ๐ ๐ถ๐ป ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ๐.
โก๏ธ Rows on disk are stored in sequence.
โก๏ธ New rows are written efficiently since you can write the entire row at once.
๐
โก๏ธ For select statements that target a subset of columns, reading is slower since you need to scan all sets of rows to retrieve one of the columns.
Usually MLOps Engineers are professionals tasked with building out the ML Platform in the organization.
๐
This means that the skill set required is very broad - naturally very few people start off with the full set of skills you would need to brand yourself as a MLOps Engineer. This is why I would not choose this role if you are just entering the market.
You are very likely to run into a ๐๐ถ๐๐๐ฟ๐ถ๐ฏ๐๐๐ฒ๐ฑ ๐๐ผ๐บ๐ฝ๐๐๐ฒ ๐ฆ๐๐๐๐ฒ๐บ ๐ผ๐ฟ ๐๐ฟ๐ฎ๐บ๐ฒ๐๐ผ๐ฟ๐ธ in your career. It could be ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ, ๐๐ถ๐๐ฒ, ๐ฃ๐ฟ๐ฒ๐๐๐ผ or any other.
๐
Also, it is very likely that these Frameworks would be reading data from a distributed storage. It could be ๐๐๐๐ฆ, ๐ฆ๐ฏ etc.
So how do we implement ๐ฃ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป ๐๐ฟ๐ฎ๐ฑ๐ฒ ๐๐ฎ๐๐ฐ๐ต ๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ in ๐ง๐ต๐ฒ ๐ ๐๐ข๐ฝ๐ ๐ช๐ฎ๐?
๐ญ: Everything starts in version control: Machine Learning Training Pipeline is defined in code, once merged to the main branch it is built and triggered.
๐
๐ฎ: Feature preprocessing stage: Features are retrieved from the Feature Store, validated and passed to the next stage. Any feature related metadata is saved to an Experiment Tracking System.