1๏ธโฃ โ๐๐๐ป๐ฑ๐ฎ๐บ๐ฒ๐ป๐๐ฎ๐น๐ ๐ผ๐ณ ๐๐ฎ๐๐ฎ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ดโ - A book that I wish I had 5 years ago. After reading it you will understand the entire Data Engineering workflow. It will prepare you for further deep dives.
๐
2๏ธโฃ โ๐๐ฐ๐ฐ๐ฒ๐น๐ฒ๐ฟ๐ฎ๐๐ฒโ - Data Engineers should follow the same practices that Software Engineers do and more. After reading this book you will understand DevOps practices in and out.
๐
3๏ธโฃ โ๐๐ฒ๐๐ถ๐ด๐ป๐ถ๐ป๐ด ๐๐ฎ๐๐ฎ-๐๐ป๐๐ฒ๐ป๐๐ถ๐๐ฒ ๐๐ฝ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป๐โ - Delve deeper into Data Engineering Fundamentals. After reading the book you will understand Storage Formats, Distributed Technologies, Distributed Consensus algorithms and more.
๐
4๏ธโฃ โ๐ง๐ฒ๐ฎ๐บ ๐ง๐ผ๐ฝ๐ผ๐น๐ผ๐ด๐ถ๐ฒ๐โ - Sometimes you might get confused about why a certain communication pattern is in place in the company you work for. After reading this book you will learn the Team Topologies model of organizational structure for fast flow.
๐
5๏ธโฃ โ๐๐ฎ๐๐ฎ ๐ ๐ฒ๐๐ตโ - Data Mesh has become an extremely popular buzzword in recent years. By reading this book you will understand the intent by the author of the term herself. Donโt be the one to throw around the term without understanding its meaning deeply.
๐
[๐ก๐ข๐ง๐]: All of the books above are talking about Fundamental concepts, even if you read all of them and decide that Data Engineering is not for you - you will be able to reuse the knowledge in any other Tech Role.
๐
[๐๐๐๐๐ง๐๐ข๐ก๐๐ ๐ก๐ข๐ง๐]: I did not include any Fundamental Classics in the list as these can be picked up after you have already established yourself in the role.
Feature Store System sits between Data Engineering and Machine Learning Pipelines and it solves the following issues:
โก๏ธ Eliminates Training/Serving skew by syncing Batch and Online Serving Storages (5)
๐
โก๏ธ Enables Feature Sharing and Discoverability through the Metadata Layer - you define the Feature Transformations once, enable discoverability through the Feature Catalog and then serve Feature Sets for training and inference purposes trough unified interface (4๏ธ,3).
๐๐ต๐ฎ๐ป๐ด๐ฒ ๐๐ฎ๐๐ฎ ๐๐ฎ๐ฝ๐๐๐ฟ๐ฒ is a software process used to replicate actions performed against Operational Databases for use in downstream applications.
โก๏ธ ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ ๐ฅ๐ฒ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป (refer to 3๏ธโฃ in the Diagram).
๐ CDC can be used for moving transactions performed against Source Database to a Target DB. If each transaction is replicated - it is possible to retain all ACID guarantees when performing replication.
It should be composed of two integrated parts: Experiment Tracking System and a Model Registry.
From where you track ML Pipeline metadata will depend on MLOps maturity in your company.
If you are at the beginning of the ML journey you might be:
๐
1๏ธโฃ Training and Serving your Models from experimentation environment - you run ML Pipelines inside of your Notebook and do that manually at each retraining.
If you are beyond Notebooks you will be running ML Pipelines from CI/CD Pipelines and on Orchestrator triggers.