Aurimas Griciลซnas Profile picture
Feb 23 โ€ข 15 tweets โ€ข 3 min read
Do you know how ๐—”๐—ฝ๐—ฎ๐—ฐ๐—ต๐—ฒ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—ถ๐˜€ ๐—”๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐—ฒ๐—ฑ?

Find out in the ๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience Image
๐—”๐—ฝ๐—ฎ๐—ฐ๐—ต๐—ฒ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ is an extremely popular distributed processing framework utilizing in-memory processing to speed up task execution. Most of its libraries are contained in the Spark Core layer.

๐Ÿ‘‡
As a warm up exercise for later deeper dives and tips, today we focus on some architecture basics.

๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—ต๐—ฎ๐˜€ ๐˜€๐—ฒ๐˜ƒ๐—ฒ๐—ฟ๐—ฎ๐—น ๐—ต๐—ถ๐—ด๐—ต ๐—น๐—ฒ๐˜ƒ๐—ฒ๐—น ๐—”๐—ฃ๐—œ๐˜€ ๐—ฏ๐˜‚๐—ถ๐—น๐˜ ๐—ผ๐—ป ๐˜๐—ผ๐—ฝ ๐—ผ๐—ณ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—–๐—ผ๐—ฟ๐—ฒ ๐˜๐—ผ ๐˜€๐˜‚๐—ฝ๐—ฝ๐—ผ๐—ฟ๐˜ ๐—ฑ๐—ถ๐—ณ๐—ณ๐—ฒ๐—ฟ๐—ฒ๐—ป๐˜ ๐˜‚๐˜€๐—ฒ ๐—ฐ๐—ฎ๐˜€๐—ฒ๐˜€:

๐Ÿ‘‡
โžก๏ธ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ๐—ฆ๐—ค๐—Ÿ - Batch Processing.
โžก๏ธ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—ฆ๐˜๐—ฟ๐—ฒ๐—ฎ๐—บ๐—ถ๐—ป๐—ด - Near to Real-Time Processing.
โžก๏ธ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐— ๐—Ÿ๐—น๐—ถ๐—ฏ - Machine Learning.
โžก๏ธ ๐—š๐—ฟ๐—ฎ๐—ฝ๐—ต๐—ซ - Graph Structures and Algorithms.

๐Ÿ‘‡
๐—ฆ๐˜‚๐—ฝ๐—ฝ๐—ผ๐—ฟ๐˜๐—ฒ๐—ฑ ๐—ฝ๐—ฟ๐—ผ๐—ด๐—ฟ๐—ฎ๐—บ๐—บ๐—ถ๐—ป๐—ด ๐—Ÿ๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ๐˜€:

โžก๏ธ Scala
โžก๏ธ Java
โžก๏ธ Python
โžก๏ธ R

๐Ÿ‘‡
๐—š๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐—น ๐—”๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ:

1๏ธโƒฃ Once you submit a ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—”๐—ฝ๐—ฝ๐—น๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป - ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ๐—–๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ ๐—ข๐—ฏ๐—ท๐—ฒ๐—ฐ๐˜ is created in the ๐——๐—ฟ๐—ถ๐˜ƒ๐—ฒ๐—ฟ ๐—ฃ๐—ฟ๐—ผ๐—ด๐—ฟ๐—ฎ๐—บ. This Object is responsible for communicating with the ๐—–๐—น๐˜‚๐˜€๐˜๐—ฒ๐—ฟ ๐— ๐—ฎ๐—ป๐—ฎ๐—ด๐—ฒ๐—ฟ.

๐Ÿ‘‡
2๏ธโƒฃ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ๐—–๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ negotiates with Cluster Manager for required resources to run ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—”๐—ฝ๐—ฝ๐—น๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป. ๐—–๐—น๐˜‚๐˜€๐˜๐—ฒ๐—ฟ ๐— ๐—ฎ๐—ป๐—ฎ๐—ด๐—ฒ๐—ฟ allocates the resources inside of a respective Cluster and creates a requested number of ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—˜๐˜…๐—ฒ๐—ฐ๐˜‚๐˜๐—ผ๐—ฟ๐˜€.

๐Ÿ‘‡
3๏ธโƒฃ After starting - Spark Executors will connect with ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ๐—–๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ to notify about joining the Cluster. ๐—˜๐˜…๐—ฒ๐—ฐ๐˜‚๐˜๐—ผ๐—ฟ๐˜€ will be sending heartbeats regularly to notify the ๐——๐—ฟ๐—ถ๐˜ƒ๐—ฒ๐—ฟ ๐—ฃ๐—ฟ๐—ผ๐—ด๐—ฟ๐—ฎ๐—บ that they are healthy and donโ€™t need rescheduling.

๐Ÿ‘‡
4๏ธโƒฃ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—˜๐˜…๐—ฒ๐—ฐ๐˜‚๐˜๐—ผ๐—ฟ๐˜€ are responsible for executing tasks of the ๐—–๐—ผ๐—บ๐—ฝ๐˜‚๐˜๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐——๐—”๐—š (๐——๐—ถ๐—ฟ๐—ฒ๐—ฐ๐˜๐—ฒ๐—ฑ ๐—”๐—ฐ๐˜†๐—ฐ๐—น๐—ถ๐—ฐ ๐—š๐—ฟ๐—ฎ๐—ฝ๐—ต). This could include reading, writing data or performing a certain operation on a partition of RDDs.

๐Ÿ‘‡
๐—ฆ๐˜‚๐—ฝ๐—ฝ๐—ผ๐—ฟ๐˜๐—ฒ๐—ฑ ๐—–๐—น๐˜‚๐˜€๐˜๐—ฒ๐—ฟ ๐— ๐—ฎ๐—ป๐—ฎ๐—ด๐—ฒ๐—ฟ๐˜€:

โžก๏ธ ๐—ฆ๐˜๐—ฎ๐—ป๐—ฑ๐—ฎ๐—น๐—ผ๐—ป๐—ฒ - simple cluster manager shipped together with Spark.
โžก๏ธ ๐—›๐—ฎ๐—ฑ๐—ผ๐—ผ๐—ฝ ๐—ฌ๐—”๐—ฅ๐—ก - resource manager of Hadoop ecosystem.

๐Ÿ‘‡
โžก๏ธ ๐—”๐—ฝ๐—ฎ๐—ฐ๐—ต๐—ฒ ๐— ๐—ฒ๐˜€๐—ผ๐˜€ - general cluster manager (โ—๏ธ deprecated).
โžก๏ธ ๐—ž๐˜‚๐—ฏ๐—ฒ๐—ฟ๐—ป๐—ฒ๐˜๐—ฒ๐˜€ - popular open-source container orchestrator.

๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—๐—ผ๐—ฏ ๐—œ๐—ป๐˜๐—ฒ๐—ฟ๐—ป๐—ฎ๐—น๐˜€:

๐Ÿ‘‡
๐Ÿ‘‰ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐——๐—ฟ๐—ถ๐˜ƒ๐—ฒ๐—ฟ is responsible for constructing an optimized physical execution plan for a given application submitted for execution.
๐Ÿ‘‰ This plan materializes into a Job which is a ๐——๐—”๐—š ๐—ผ๐—ณ ๐—ฆ๐˜๐—ฎ๐—ด๐—ฒ๐˜€.

๐Ÿ‘‡
๐Ÿ‘‰ Some of the ๐—ฆ๐˜๐—ฎ๐—ด๐—ฒ๐˜€ can be executed in parallel if they have no sequential dependencies.
๐Ÿ‘‰ Each ๐—ฆ๐˜๐—ฎ๐—ด๐—ฒ is composed of ๐—ง๐—ฎ๐˜€๐—ธ๐˜€.

๐Ÿ‘‡
๐Ÿ‘‰ All ๐—ง๐—ฎ๐˜€๐—ธ๐˜€ of a single ๐—ฆ๐˜๐—ฎ๐—ด๐—ฒ contain the same type of work which is the smallest piece of work that can be executed in parallel and is performed by ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—˜๐˜…๐—ฒ๐—ฐ๐˜‚๐˜๐—ผ๐—ฟ๐˜€.

๐Ÿ‘‡
Join a growing community of 6000+ Data Professionals by subscribing to my ๐—ก๐—ฒ๐˜„๐˜€๐—น๐—ฒ๐˜๐˜๐—ฒ๐—ฟ: newsletter.swirlai.com/p/sai-03-machiโ€ฆ

โ€ข โ€ข โ€ข

Missing some Tweet in this thread? You can try to force a refresh
ใ€€

Keep Current with Aurimas Griciลซnas

Aurimas Griciลซnas Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Aurimas_Gr

Feb 23
A refresher on the role of ๐——๐—ฎ๐˜๐—ฎ ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ฎ๐—ฐ๐˜๐˜€ in the Data Pipeline.

Read on in the ๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
In its simplest form Data Contract is an agreement between Data Producers and Data Consumers on what the Data being produced should look like, what SLAs it should meet and the semantics of it.

๐Ÿ‘‡
๐——๐—ฎ๐˜๐—ฎ ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ฎ๐—ฐ๐˜ ๐˜€๐—ต๐—ผ๐˜‚๐—น๐—ฑ ๐—ต๐—ผ๐—น๐—ฑ ๐˜๐—ต๐—ฒ ๐—ณ๐—ผ๐—น๐—น๐—ผ๐˜„๐—ถ๐—ป๐—ด ๐—ป๐—ผ๐—ป-๐—ฒ๐˜…๐—ต๐—ฎ๐˜‚๐˜€๐˜๐—ถ๐˜ƒ๐—ฒ ๐—น๐—ถ๐˜€๐˜ ๐—ผ๐—ณ ๐—บ๐—ฒ๐˜๐—ฎ๐—ฑ๐—ฎ๐˜๐—ฎ:

๐Ÿ‘‰ Schema of the Data being Produced.

๐Ÿ‘‡
Read 14 tweets
Feb 22
What does a ๐—ฅ๐—ฒ๐—ฎ๐—น ๐—ง๐—ถ๐—บ๐—ฒ ๐—ฆ๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ผ๐—ฟ ๐—ฅ๐—ฒ๐—ฐ๐—ผ๐—บ๐—บ๐—ฒ๐—ป๐—ฑ๐—ฒ๐—ฟ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ ๐——๐—ฒ๐˜€๐—ถ๐—ด๐—ป look like?

The graph was inspired by the amazing work of @eugeneyan

More in the ๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
Recommender and Search Systems are one of the biggest money makers for most companies when it comes to Machine Learning.

๐Ÿ‘‡
Both Systems are inherently similar. Their goal is to return a list of recommended items given a certain context - it could be a search query in the e-commerce website or a list of recommended songs given that you are currently listening to a certain song on Spotify.

๐Ÿ‘‡
Read 12 tweets
Feb 21
Here is a short refresher on ๐—”๐—–๐—œ๐—— ๐—ฃ๐—ฟ๐—ผ๐—ฝ๐—ฒ๐—ฟ๐˜๐—ถ๐—ฒ๐˜€ ๐—ผ๐—ณ ๐——๐—•๐— ๐—ฆ (๐——๐—ฎ๐˜๐—ฎ๐—ฏ๐—ฎ๐˜€๐—ฒ ๐— ๐—ฎ๐—ป๐—ฎ๐—ด๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ).

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience Image
It could be that you are taking ACID Properties for granted when you are using transactional databases.

If you are interviewing for Data Engineering roles you will be asked to explain what the concept means.

๐Ÿ‘‡
Letโ€™s take a closer look.

Transaction is a sequence of steps performed on a database as a single logical unit of work.

The ACID database transaction model ensures that a performed transaction is always consistent by ensuring:

๐Ÿ‘‡
Read 8 tweets
Feb 1
๐—ก๐—ผ ๐—˜๐˜…๐—ฐ๐˜‚๐˜€๐—ฒ๐˜€ ๐——๐—ฎ๐˜๐—ฎ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ๐—ถ๐—ป๐—ด ๐—ฃ๐—ผ๐—ฟ๐˜๐—ณ๐—ผ๐—น๐—ถ๐—ผ ๐—ง๐—ฒ๐—บ๐—ฝ๐—น๐—ฎ๐˜๐—ฒ - next week I will enrich it with the missing Machine Learning and MLOps parts!

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
Today - letโ€™s review it once more. It is super helpful as these kind of Data Architectures are what you will find in real life situations.

๐—ฅ๐—ฒ๐—ฐ๐—ฎ๐—ฝ:

๐Ÿ‘‡
๐Ÿญ. Data Producers - Python Applications that extract data from chosen Data Sources and push it to Collector via REST or gRPC API calls.

๐Ÿ‘‡
Read 14 tweets
Jan 31
What are ๐—Ÿ๐—ฎ๐—บ๐—ฏ๐—ฑ๐—ฎ ๐—ฎ๐—ป๐—ฑ ๐—ž๐—ฎ๐—ฝ๐—ฝ๐—ฎ ๐—”๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ๐˜€?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
Lambda and Kappa are both Data architectures proposed to solve movement of large amounts of data for reliable Online access.

๐Ÿ‘‡
The most popular architecture has been and continues to be Lambda. However, with Stream Processing becoming more accessible to organizations of every size you will be hearing a lot more of Kappa in the near future. Letโ€™s see how they are different.

๐Ÿ‘‡
Read 15 tweets
Jan 30
Letโ€™s remind ourselves of how a ๐—ฅ๐—ฒ๐—พ๐˜‚๐—ฒ๐˜€๐˜-๐—ฅ๐—ฒ๐˜€๐—ฝ๐—ผ๐—ป๐˜€๐—ฒ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐——๐—ฒ๐—ฝ๐—น๐—ผ๐˜†๐—บ๐—ฒ๐—ป๐˜ looks like - ๐—ง๐—ต๐—ฒ ๐— ๐—Ÿ๐—ข๐—ฝ๐˜€ ๐—ช๐—ฎ๐˜†.

๐Ÿงต

#MLOps #MachineLearning #DataScience #Data Image
You will find this type of model deployment to be the most popular when it comes to Online Machine Learning Systems.

Let's zoom in:

๐Ÿญ: Version Control: Machine Learning Training Pipeline is defined in code, once merged to the main branch it is built and triggered.

๐Ÿ‘‡
๐Ÿฎ: Feature Preprocessing: Features are retrieved from the Feature Store, validated and passed to the next stage. Any feature related metadata that is tightly coupled to the Model being trained is saved to the Experiment Tracking System.

๐Ÿ‘‡
Read 14 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(