Aurimas Griciลซnas Profile picture
Feb 28 โ€ข 12 tweets โ€ข 4 min read
What is the difference between Splittable and Non-Splittable Files?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
You are very likely to run into a ๐——๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ฒ๐—ฑ ๐—–๐—ผ๐—บ๐—ฝ๐˜‚๐˜๐—ฒ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ ๐—ผ๐—ฟ ๐—™๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜„๐—ผ๐—ฟ๐—ธ in your career. It could be ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ, ๐—›๐—ถ๐˜ƒ๐—ฒ, ๐—ฃ๐—ฟ๐—ฒ๐˜€๐˜๐—ผ or any other.

๐Ÿ‘‡
Also, it is very likely that these Frameworks would be reading data from a distributed storage. It could be ๐—›๐——๐—™๐—ฆ, ๐—ฆ๐Ÿฏ etc.

๐Ÿ‘‡
These Frameworks utilize multiple ๐—–๐—ฃ๐—จ ๐—–๐—ผ๐—ฟ๐—ฒ๐˜€ ๐—ณ๐—ผ๐—ฟ ๐—Ÿ๐—ผ๐—ฎ๐—ฑ๐—ถ๐—ป๐—ด ๐——๐—ฎ๐˜๐—ฎ and performing ๐——๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ฒ๐—ฑ ๐—–๐—ผ๐—บ๐—ฝ๐˜‚๐˜๐—ฒ in parallel.

๐Ÿ‘‡
How files are stored in your ๐—ฆ๐˜๐—ผ๐—ฟ๐—ฎ๐—ด๐—ฒ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ ๐—ถ๐˜€ ๐—ž๐—ฒ๐˜† for utilizing distributed ๐—ฅ๐—ฒ๐—ฎ๐—ฑ ๐—ฎ๐—ป๐—ฑ ๐—–๐—ผ๐—บ๐—ฝ๐˜‚๐˜๐—ฒ ๐—˜๐—ณ๐—ณ๐—ถ๐—ฐ๐—ถ๐—ฒ๐—ป๐˜๐—น๐˜†.

๐—ฆ๐—ผ๐—บ๐—ฒ ๐—ฑ๐—ฒ๐—ณ๐—ถ๐—ป๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€:

๐Ÿ‘‡
โžก๏ธ ๐—ฆ๐—ฝ๐—น๐—ถ๐˜๐˜๐—ฎ๐—ฏ๐—น๐—ฒ ๐—™๐—ถ๐—น๐—ฒ๐˜€ are Files that can be partially read by several processes at the same time.
โžก๏ธ In distributed file or block storages files are stored in chunks called blocks.
โžก๏ธ Block sizes will vary between different storage systems.

๐Ÿ‘‡
๐—ง๐—ต๐—ถ๐—ป๐—ด๐˜€ ๐˜๐—ผ ๐—ธ๐—ป๐—ผ๐˜„:

โžก๏ธ If your file is ๐—ก๐—ผ๐—ป-๐—ฆ๐—ฝ๐—น๐—ถ๐˜๐˜๐—ฎ๐—ฏ๐—น๐—ฒ and is bigger than a block in storage - it will be split between blocks but will only be read by a ๐—ฆ๐—ถ๐—ป๐—ด๐—น๐—ฒ ๐—–๐—ฃ๐—จ ๐—–๐—ผ๐—ฟ๐—ฒ which might cause ๐—œ๐—ฑ๐—น๐—ฒ ๐—–๐—ฃ๐—จ time.

๐Ÿ‘‡
โžก๏ธ If your file is ๐—ฆ๐—ฝ๐—น๐—ถ๐˜๐˜๐—ฎ๐—ฏ๐—น๐—ฒ - multiple cores can read it at the same time (one core per block).

๐Ÿ‘‡
๐—ฆ๐—ผ๐—บ๐—ฒ ๐—ด๐˜‚๐—ถ๐—ฑ๐—ฎ๐—ป๐—ฐ๐—ฒ:

โžก๏ธ If possible - prefer ๐—ฆ๐—ฝ๐—น๐—ถ๐˜๐˜๐—ฎ๐—ฏ๐—น๐—ฒ ๐—™๐—ถ๐—น๐—ฒ types.
โžก๏ธ If you are forced to use ๐—ก๐—ผ๐—ป-๐—ฆ๐—ฝ๐—น๐—ถ๐˜๐˜๐—ฎ๐—ฏ๐—น๐—ฒ files - manually partition them into sizes that would fit into a single FS Block to utilize more CPU Cores.

๐Ÿ‘‡
๐—ฆ๐—ฝ๐—น๐—ถ๐˜๐˜๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ณ๐—ถ๐—น๐—ฒ ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐˜€:

๐Ÿ‘‰ ๐—”๐˜ƒ๐—ฟ๐—ผ.
๐Ÿ‘‰ ๐—–๐—ฆ๐—ฉ.
๐Ÿ‘‰ ๐—ข๐—ฅ๐—–.
๐Ÿ‘‰ ๐—ป๐—ฑ๐—๐—ฆ๐—ข๐—ก.
๐Ÿ‘‰ ๐—ฃ๐—ฎ๐—ฟ๐—พ๐˜‚๐—ฒ๐˜.

๐Ÿ‘‡
๐—ก๐—ผ๐—ป-๐—ฆ๐—ฝ๐—น๐—ถ๐˜๐˜๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ณ๐—ถ๐—น๐—ฒ ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐˜€:

๐Ÿ‘‰ ๐—ฃ๐—ฟ๐—ผ๐˜๐—ผ๐—ฐ๐—ผ๐—น ๐—•๐˜‚๐—ณ๐—ณ๐—ฒ๐—ฟ๐˜€.
๐Ÿ‘‰ ๐—๐—ฆ๐—ข๐—ก.
๐Ÿ‘‰ ๐—ซ๐— ๐—Ÿ.

๐Ÿ‘‡
๐Ÿ‘‹ I am Aurimas.

I will help you Level Up in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space.

๐—™๐—ผ๐—น๐—น๐—ผ๐˜„ ๐—บ๐—ฒ and hit ๐Ÿ””

Join a growing community of 6000+ Data Professionals by subscribing to my ๐—ก๐—ฒ๐˜„๐˜€๐—น๐—ฒ๐˜๐˜๐—ฒ๐—ฟ: newsletter.swirlai.com/p/sai-19-the-dโ€ฆ

โ€ข โ€ข โ€ข

Missing some Tweet in this thread? You can try to force a refresh
ใ€€

Keep Current with Aurimas Griciลซnas

Aurimas Griciลซnas Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Aurimas_Gr

Mar 1
Considering switching to a ๐— ๐—Ÿ๐—ข๐—ฝ๐˜€ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ role?

My thought in the ๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience Image
Usually MLOps Engineers are professionals tasked with building out the ML Platform in the organization.

๐Ÿ‘‡
This means that the skill set required is very broad - naturally very few people start off with the full set of skills you would need to brand yourself as a MLOps Engineer. This is why I would not choose this role if you are just entering the market.

๐Ÿ‘‡
Read 10 tweets
Feb 28
So how do we implement ๐—ฃ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—š๐—ฟ๐—ฎ๐—ฑ๐—ฒ ๐—•๐—ฎ๐˜๐—ฐ๐—ต ๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ in ๐—ง๐—ต๐—ฒ ๐— ๐—Ÿ๐—ข๐—ฝ๐˜€ ๐—ช๐—ฎ๐˜†?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
Letโ€™s zoom in:

๐Ÿญ: Everything starts in version control: Machine Learning Training Pipeline is defined in code, once merged to the main branch it is built and triggered.

๐Ÿ‘‡
๐Ÿฎ: Feature preprocessing stage: Features are retrieved from the Feature Store, validated and passed to the next stage. Any feature related metadata is saved to an Experiment Tracking System.

๐Ÿ‘‡
Read 13 tweets
Feb 27
How do we ๐——๐—ฒ๐—ฐ๐—ผ๐—บ๐—ฝ๐—ผ๐˜€๐—ฒ ๐—ฅ๐—ฒ๐—ฎ๐—น ๐—ง๐—ถ๐—บ๐—ฒ ๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐—ฆ๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ฐ๐—ฒ ๐—Ÿ๐—ฎ๐˜๐—ฒ๐—ป๐—ฐ๐˜† and why should you care to understand the pieces as a ML Engineer?

Find out in the ๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience Image
Usually, what is cared about by the users of your Machine Learning Service is the total endpoint latency - the time difference between when a request is performed (1.) against the Service till when the response is received (6.).

๐Ÿ‘‡
Certain SLAs will be established on what the acceptable latency is and you will need to reach that. Being able to decompose the total latency is even more important as you can improve each piece independently. Let's see how.

๐Ÿ‘‡
Read 13 tweets
Feb 23
Do you know how ๐—”๐—ฝ๐—ฎ๐—ฐ๐—ต๐—ฒ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—ถ๐˜€ ๐—”๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐—ฒ๐—ฑ?

Find out in the ๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience Image
๐—”๐—ฝ๐—ฎ๐—ฐ๐—ต๐—ฒ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ is an extremely popular distributed processing framework utilizing in-memory processing to speed up task execution. Most of its libraries are contained in the Spark Core layer.

๐Ÿ‘‡
As a warm up exercise for later deeper dives and tips, today we focus on some architecture basics.

๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—ต๐—ฎ๐˜€ ๐˜€๐—ฒ๐˜ƒ๐—ฒ๐—ฟ๐—ฎ๐—น ๐—ต๐—ถ๐—ด๐—ต ๐—น๐—ฒ๐˜ƒ๐—ฒ๐—น ๐—”๐—ฃ๐—œ๐˜€ ๐—ฏ๐˜‚๐—ถ๐—น๐˜ ๐—ผ๐—ป ๐˜๐—ผ๐—ฝ ๐—ผ๐—ณ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—–๐—ผ๐—ฟ๐—ฒ ๐˜๐—ผ ๐˜€๐˜‚๐—ฝ๐—ฝ๐—ผ๐—ฟ๐˜ ๐—ฑ๐—ถ๐—ณ๐—ณ๐—ฒ๐—ฟ๐—ฒ๐—ป๐˜ ๐˜‚๐˜€๐—ฒ ๐—ฐ๐—ฎ๐˜€๐—ฒ๐˜€:

๐Ÿ‘‡
Read 15 tweets
Feb 23
A refresher on the role of ๐——๐—ฎ๐˜๐—ฎ ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ฎ๐—ฐ๐˜๐˜€ in the Data Pipeline.

Read on in the ๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
In its simplest form Data Contract is an agreement between Data Producers and Data Consumers on what the Data being produced should look like, what SLAs it should meet and the semantics of it.

๐Ÿ‘‡
๐——๐—ฎ๐˜๐—ฎ ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ฎ๐—ฐ๐˜ ๐˜€๐—ต๐—ผ๐˜‚๐—น๐—ฑ ๐—ต๐—ผ๐—น๐—ฑ ๐˜๐—ต๐—ฒ ๐—ณ๐—ผ๐—น๐—น๐—ผ๐˜„๐—ถ๐—ป๐—ด ๐—ป๐—ผ๐—ป-๐—ฒ๐˜…๐—ต๐—ฎ๐˜‚๐˜€๐˜๐—ถ๐˜ƒ๐—ฒ ๐—น๐—ถ๐˜€๐˜ ๐—ผ๐—ณ ๐—บ๐—ฒ๐˜๐—ฎ๐—ฑ๐—ฎ๐˜๐—ฎ:

๐Ÿ‘‰ Schema of the Data being Produced.

๐Ÿ‘‡
Read 14 tweets
Feb 22
What does a ๐—ฅ๐—ฒ๐—ฎ๐—น ๐—ง๐—ถ๐—บ๐—ฒ ๐—ฆ๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ผ๐—ฟ ๐—ฅ๐—ฒ๐—ฐ๐—ผ๐—บ๐—บ๐—ฒ๐—ป๐—ฑ๐—ฒ๐—ฟ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ ๐——๐—ฒ๐˜€๐—ถ๐—ด๐—ป look like?

The graph was inspired by the amazing work of @eugeneyan

More in the ๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
Recommender and Search Systems are one of the biggest money makers for most companies when it comes to Machine Learning.

๐Ÿ‘‡
Both Systems are inherently similar. Their goal is to return a list of recommended items given a certain context - it could be a search query in the e-commerce website or a list of recommended songs given that you are currently listening to a certain song on Spotify.

๐Ÿ‘‡
Read 12 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(