I believe that the following is a correct order to start in ๐ฌ๐ผ๐๐ฟ ๐๐ฎ๐๐ฎ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด ๐ฃ๐ฎ๐๐ต:
๐ Data Extraction
๐ Data Validation
๐ Data Contracts
๐ Loading Data into a DWH / Data Lake
๐ Transformations in a DWH / Data Lake
๐ Scheduling
๐ Airflow - ๐๐ฒ๐ ๐๐ถ๐ฟ๐ณ๐น๐ผ๐, there are many who say that focusing only on Airflow as a scheduler is a narrow minded approach. Well, you will find Airflow in 99% of job ads - start with it, forget what people say.
๐ Spark
๐ DBT
Kafka is an extremely important ๐๐ถ๐๐๐ฟ๐ถ๐ฏ๐๐๐ฒ๐ฑ ๐ ๐ฒ๐๐๐ฎ๐ด๐ถ๐ป๐ด ๐ฆ๐๐๐๐ฒ๐บ to understand as it was the first of its kind and most of the new products are built on the ideas of Kafka.
โก๏ธ Clients writing to Kafka are called ๐ฃ๐ฟ๐ผ๐ฑ๐๐ฐ๐ฒ๐ฟ๐,
โก๏ธ Clients reading the Data are called ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ๐.
โก๏ธ Data is written into ๐ง๐ผ๐ฝ๐ถ๐ฐ๐ that can be compared to ๐ง๐ฎ๐ฏ๐น๐ฒ๐ ๐ถ๐ป ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ๐.
โก๏ธ Rows on disk are stored in sequence.
โก๏ธ New rows are written efficiently since you can write the entire row at once.
๐
โก๏ธ For select statements that target a subset of columns, reading is slower since you need to scan all sets of rows to retrieve one of the columns.
We have covered lots of concepts around Kafka already. But what are the most common use cases for The System that you are very likely to run into as a Data Engineer?
โก๏ธ The Original use case for Kafka by LinkedIn.
โก๏ธ Events happening in the website like page views, conversions etc. are sent via a Gateway and piped to Kafka Topics.
Usually MLOps Engineers are professionals tasked with building out the ML Platform in the organization.
๐
This means that the skill set required is very broad - naturally very few people start off with the full set of skills you would need to brand yourself as a MLOps Engineer. This is why I would not choose this role if you are just entering the market.
You are very likely to run into a ๐๐ถ๐๐๐ฟ๐ถ๐ฏ๐๐๐ฒ๐ฑ ๐๐ผ๐บ๐ฝ๐๐๐ฒ ๐ฆ๐๐๐๐ฒ๐บ ๐ผ๐ฟ ๐๐ฟ๐ฎ๐บ๐ฒ๐๐ผ๐ฟ๐ธ in your career. It could be ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ, ๐๐ถ๐๐ฒ, ๐ฃ๐ฟ๐ฒ๐๐๐ผ or any other.
๐
Also, it is very likely that these Frameworks would be reading data from a distributed storage. It could be ๐๐๐๐ฆ, ๐ฆ๐ฏ etc.
So how do we implement ๐ฃ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป ๐๐ฟ๐ฎ๐ฑ๐ฒ ๐๐ฎ๐๐ฐ๐ต ๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ in ๐ง๐ต๐ฒ ๐ ๐๐ข๐ฝ๐ ๐ช๐ฎ๐?
๐ญ: Everything starts in version control: Machine Learning Training Pipeline is defined in code, once merged to the main branch it is built and triggered.
๐
๐ฎ: Feature preprocessing stage: Features are retrieved from the Feature Store, validated and passed to the next stage. Any feature related metadata is saved to an Experiment Tracking System.