๐๐ฝ๐ฎ๐ฐ๐ต๐ฒ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ is an extremely popular distributed processing framework utilizing in-memory processing to speed up task execution. Most of its libraries are contained in the Spark Core layer.
๐
As a warm up exercise for later deeper dives and tips, today we focus on some architecture basics.
1๏ธโฃ Once you submit a ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐ฝ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป - ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ๐๐ผ๐ป๐๐ฒ๐ ๐ ๐ข๐ฏ๐ท๐ฒ๐ฐ๐ is created in the ๐๐ฟ๐ถ๐๐ฒ๐ฟ ๐ฃ๐ฟ๐ผ๐ด๐ฟ๐ฎ๐บ. This Object is responsible for communicating with the ๐๐น๐๐๐๐ฒ๐ฟ ๐ ๐ฎ๐ป๐ฎ๐ด๐ฒ๐ฟ.
๐
2๏ธโฃ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ๐๐ผ๐ป๐๐ฒ๐ ๐ negotiates with Cluster Manager for required resources to run ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐ฝ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป. ๐๐น๐๐๐๐ฒ๐ฟ ๐ ๐ฎ๐ป๐ฎ๐ด๐ฒ๐ฟ allocates the resources inside of a respective Cluster and creates a requested number of ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐ ๐ฒ๐ฐ๐๐๐ผ๐ฟ๐.
๐
3๏ธโฃ After starting - Spark Executors will connect with ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ๐๐ผ๐ป๐๐ฒ๐ ๐ to notify about joining the Cluster. ๐๐ ๐ฒ๐ฐ๐๐๐ผ๐ฟ๐ will be sending heartbeats regularly to notify the ๐๐ฟ๐ถ๐๐ฒ๐ฟ ๐ฃ๐ฟ๐ผ๐ด๐ฟ๐ฎ๐บ that they are healthy and donโt need rescheduling.
๐
4๏ธโฃ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐ ๐ฒ๐ฐ๐๐๐ผ๐ฟ๐ are responsible for executing tasks of the ๐๐ผ๐บ๐ฝ๐๐๐ฎ๐๐ถ๐ผ๐ป ๐๐๐ (๐๐ถ๐ฟ๐ฒ๐ฐ๐๐ฒ๐ฑ ๐๐ฐ๐๐ฐ๐น๐ถ๐ฐ ๐๐ฟ๐ฎ๐ฝ๐ต). This could include reading, writing data or performing a certain operation on a partition of RDDs.
๐ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐ฟ๐ถ๐๐ฒ๐ฟ is responsible for constructing an optimized physical execution plan for a given application submitted for execution.
๐ This plan materializes into a Job which is a ๐๐๐ ๐ผ๐ณ ๐ฆ๐๐ฎ๐ด๐ฒ๐.
๐
๐ Some of the ๐ฆ๐๐ฎ๐ด๐ฒ๐ can be executed in parallel if they have no sequential dependencies.
๐ Each ๐ฆ๐๐ฎ๐ด๐ฒ is composed of ๐ง๐ฎ๐๐ธ๐.
๐
๐ All ๐ง๐ฎ๐๐ธ๐ of a single ๐ฆ๐๐ฎ๐ด๐ฒ contain the same type of work which is the smallest piece of work that can be executed in parallel and is performed by ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐ ๐ฒ๐ฐ๐๐๐ผ๐ฟ๐.
In its simplest form Data Contract is an agreement between Data Producers and Data Consumers on what the Data being produced should look like, what SLAs it should meet and the semantics of it.
What does a ๐ฅ๐ฒ๐ฎ๐น ๐ง๐ถ๐บ๐ฒ ๐ฆ๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐ผ๐ฟ ๐ฅ๐ฒ๐ฐ๐ผ๐บ๐บ๐ฒ๐ป๐ฑ๐ฒ๐ฟ ๐ฆ๐๐๐๐ฒ๐บ ๐๐ฒ๐๐ถ๐ด๐ป look like?
The graph was inspired by the amazing work of @eugeneyan
Recommender and Search Systems are one of the biggest money makers for most companies when it comes to Machine Learning.
๐
Both Systems are inherently similar. Their goal is to return a list of recommended items given a certain context - it could be a search query in the e-commerce website or a list of recommended songs given that you are currently listening to a certain song on Spotify.
Here is a short refresher on ๐๐๐๐ ๐ฃ๐ฟ๐ผ๐ฝ๐ฒ๐ฟ๐๐ถ๐ฒ๐ ๐ผ๐ณ ๐๐๐ ๐ฆ (๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ ๐ ๐ฎ๐ป๐ฎ๐ด๐ฒ๐บ๐ฒ๐ป๐ ๐ฆ๐๐๐๐ฒ๐บ).
๐ก๐ผ ๐๐ ๐ฐ๐๐๐ฒ๐ ๐๐ฎ๐๐ฎ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด ๐ฃ๐ผ๐ฟ๐๐ณ๐ผ๐น๐ถ๐ผ ๐ง๐ฒ๐บ๐ฝ๐น๐ฎ๐๐ฒ - next week I will enrich it with the missing Machine Learning and MLOps parts!
Lambda and Kappa are both Data architectures proposed to solve movement of large amounts of data for reliable Online access.
๐
The most popular architecture has been and continues to be Lambda. However, with Stream Processing becoming more accessible to organizations of every size you will be hearing a lot more of Kappa in the near future. Letโs see how they are different.
Letโs remind ourselves of how a ๐ฅ๐ฒ๐พ๐๐ฒ๐๐-๐ฅ๐ฒ๐๐ฝ๐ผ๐ป๐๐ฒ ๐ ๐ผ๐ฑ๐ฒ๐น ๐๐ฒ๐ฝ๐น๐ผ๐๐บ๐ฒ๐ป๐ looks like - ๐ง๐ต๐ฒ ๐ ๐๐ข๐ฝ๐ ๐ช๐ฎ๐.
You will find this type of model deployment to be the most popular when it comes to Online Machine Learning Systems.
Let's zoom in:
๐ญ: Version Control: Machine Learning Training Pipeline is defined in code, once merged to the main branch it is built and triggered.
๐
๐ฎ: Feature Preprocessing: Features are retrieved from the Feature Store, validated and passed to the next stage. Any feature related metadata that is tightly coupled to the Model being trained is saved to the Experiment Tracking System.