Tweet

Aurimas Griciūnas

Feb 28 • 12 tweets • 4 min read

What is the difference between Splittable and Non-Splittable Files?

🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

You are very likely to run into a 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝗱 𝗖𝗼𝗺𝗽𝘂𝘁𝗲 𝗦𝘆𝘀𝘁𝗲𝗺 𝗼𝗿 𝗙𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸 in your career. It could be 𝗦𝗽𝗮𝗿𝗸, 𝗛𝗶𝘃𝗲, 𝗣𝗿𝗲𝘀𝘁𝗼 or any other.

👇

Also, it is very likely that these Frameworks would be reading data from a distributed storage. It could be 𝗛𝗗𝗙𝗦, 𝗦𝟯 etc.

👇

These Frameworks utilize multiple 𝗖𝗣𝗨 𝗖𝗼𝗿𝗲𝘀 𝗳𝗼𝗿 𝗟𝗼𝗮𝗱𝗶𝗻𝗴 𝗗𝗮𝘁𝗮 and performing 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝗱 𝗖𝗼𝗺𝗽𝘂𝘁𝗲 in parallel.

👇

How files are stored in your 𝗦𝘁𝗼𝗿𝗮𝗴𝗲 𝗦𝘆𝘀𝘁𝗲𝗺 𝗶𝘀 𝗞𝗲𝘆 for utilizing distributed 𝗥𝗲𝗮𝗱 𝗮𝗻𝗱 𝗖𝗼𝗺𝗽𝘂𝘁𝗲 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁𝗹𝘆.

𝗦𝗼𝗺𝗲 𝗱𝗲𝗳𝗶𝗻𝗶𝘁𝗶𝗼𝗻𝘀:

👇

➡️ 𝗦𝗽𝗹𝗶𝘁𝘁𝗮𝗯𝗹𝗲 𝗙𝗶𝗹𝗲𝘀 are Files that can be partially read by several processes at the same time.
➡️ In distributed file or block storages files are stored in chunks called blocks.
➡️ Block sizes will vary between different storage systems.

👇

𝗧𝗵𝗶𝗻𝗴𝘀 𝘁𝗼 𝗸𝗻𝗼𝘄:

➡️ If your file is 𝗡𝗼𝗻-𝗦𝗽𝗹𝗶𝘁𝘁𝗮𝗯𝗹𝗲 and is bigger than a block in storage - it will be split between blocks but will only be read by a 𝗦𝗶𝗻𝗴𝗹𝗲 𝗖𝗣𝗨 𝗖𝗼𝗿𝗲 which might cause 𝗜𝗱𝗹𝗲 𝗖𝗣𝗨 time.

👇

➡️ If your file is 𝗦𝗽𝗹𝗶𝘁𝘁𝗮𝗯𝗹𝗲 - multiple cores can read it at the same time (one core per block).

👇

𝗦𝗼𝗺𝗲 𝗴𝘂𝗶𝗱𝗮𝗻𝗰𝗲:

➡️ If possible - prefer 𝗦𝗽𝗹𝗶𝘁𝘁𝗮𝗯𝗹𝗲 𝗙𝗶𝗹𝗲 types.
➡️ If you are forced to use 𝗡𝗼𝗻-𝗦𝗽𝗹𝗶𝘁𝘁𝗮𝗯𝗹𝗲 files - manually partition them into sizes that would fit into a single FS Block to utilize more CPU Cores.

👇

𝗦𝗽𝗹𝗶𝘁𝘁𝗮𝗯𝗹𝗲 𝗳𝗶𝗹𝗲 𝗳𝗼𝗿𝗺𝗮𝘁𝘀:

👉 𝗔𝘃𝗿𝗼.
👉 𝗖𝗦𝗩.
👉 𝗢𝗥𝗖.
👉 𝗻𝗱𝗝𝗦𝗢𝗡.
👉 𝗣𝗮𝗿𝗾𝘂𝗲𝘁.

👇

𝗡𝗼𝗻-𝗦𝗽𝗹𝗶𝘁𝘁𝗮𝗯𝗹𝗲 𝗳𝗶𝗹𝗲 𝗳𝗼𝗿𝗺𝗮𝘁𝘀:

👉 𝗣𝗿𝗼𝘁𝗼𝗰𝗼𝗹 𝗕𝘂𝗳𝗳𝗲𝗿𝘀.
👉 𝗝𝗦𝗢𝗡.
👉 𝗫𝗠𝗟.

👇

👋 I am Aurimas.

I will help you Level Up in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space.

𝗙𝗼𝗹𝗹𝗼𝘄 𝗺𝗲 and hit 🔔

Join a growing community of 6000+ Data Professionals by subscribing to my 𝗡𝗲𝘄𝘀𝗹𝗲𝘁𝘁𝗲𝗿: newsletter.swirlai.com/p/sai-19-the-d…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @Aurimas_Gr

Aurimas Griciūnas

@Aurimas_Gr

Mar 1

Considering switching to a 𝗠𝗟𝗢𝗽𝘀 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿 role?

My thought in the 🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

Usually MLOps Engineers are professionals tasked with building out the ML Platform in the organization.

👇

This means that the skill set required is very broad - naturally very few people start off with the full set of skills you would need to brand yourself as a MLOps Engineer. This is why I would not choose this role if you are just entering the market.

👇

Read 10 tweets

Aurimas Griciūnas

@Aurimas_Gr

Feb 28

So how do we implement 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗚𝗿𝗮𝗱𝗲 𝗕𝗮𝘁𝗰𝗵 𝗠𝗮𝗰𝗵𝗶𝗻𝗲 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗠𝗼𝗱𝗲𝗹 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲 in 𝗧𝗵𝗲 𝗠𝗟𝗢𝗽𝘀 𝗪𝗮𝘆?

🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

Let’s zoom in:

𝟭: Everything starts in version control: Machine Learning Training Pipeline is defined in code, once merged to the main branch it is built and triggered.

👇

𝟮: Feature preprocessing stage: Features are retrieved from the Feature Store, validated and passed to the next stage. Any feature related metadata is saved to an Experiment Tracking System.

👇

Read 13 tweets

Aurimas Griciūnas

@Aurimas_Gr

Feb 27

How do we 𝗗𝗲𝗰𝗼𝗺𝗽𝗼𝘀𝗲 𝗥𝗲𝗮𝗹 𝗧𝗶𝗺𝗲 𝗠𝗮𝗰𝗵𝗶𝗻𝗲 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗦𝗲𝗿𝘃𝗶𝗰𝗲 𝗟𝗮𝘁𝗲𝗻𝗰𝘆 and why should you care to understand the pieces as a ML Engineer?

Find out in the 🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

Usually, what is cared about by the users of your Machine Learning Service is the total endpoint latency - the time difference between when a request is performed (1.) against the Service till when the response is received (6.).

👇

Certain SLAs will be established on what the acceptable latency is and you will need to reach that. Being able to decompose the total latency is even more important as you can improve each piece independently. Let's see how.

👇

Read 13 tweets

Aurimas Griciūnas

@Aurimas_Gr

Feb 23

Do you know how 𝗔𝗽𝗮𝗰𝗵𝗲 𝗦𝗽𝗮𝗿𝗸 𝗶𝘀 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝗲𝗱?

Find out in the 🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

𝗔𝗽𝗮𝗰𝗵𝗲 𝗦𝗽𝗮𝗿𝗸 is an extremely popular distributed processing framework utilizing in-memory processing to speed up task execution. Most of its libraries are contained in the Spark Core layer.

👇

As a warm up exercise for later deeper dives and tips, today we focus on some architecture basics.

𝗦𝗽𝗮𝗿𝗸 𝗵𝗮𝘀 𝘀𝗲𝘃𝗲𝗿𝗮𝗹 𝗵𝗶𝗴𝗵 𝗹𝗲𝘃𝗲𝗹 𝗔𝗣𝗜𝘀 𝗯𝘂𝗶𝗹𝘁 𝗼𝗻 𝘁𝗼𝗽 𝗼𝗳 𝗦𝗽𝗮𝗿𝗸 𝗖𝗼𝗿𝗲 𝘁𝗼 𝘀𝘂𝗽𝗽𝗼𝗿𝘁 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝘁 𝘂𝘀𝗲 𝗰𝗮𝘀𝗲𝘀:

👇

Read 15 tweets

Aurimas Griciūnas

@Aurimas_Gr

Feb 23

A refresher on the role of 𝗗𝗮𝘁𝗮 𝗖𝗼𝗻𝘁𝗿𝗮𝗰𝘁𝘀 in the Data Pipeline.

Read on in the 🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

In its simplest form Data Contract is an agreement between Data Producers and Data Consumers on what the Data being produced should look like, what SLAs it should meet and the semantics of it.

👇

𝗗𝗮𝘁𝗮 𝗖𝗼𝗻𝘁𝗿𝗮𝗰𝘁 𝘀𝗵𝗼𝘂𝗹𝗱 𝗵𝗼𝗹𝗱 𝘁𝗵𝗲 𝗳𝗼𝗹𝗹𝗼𝘄𝗶𝗻𝗴 𝗻𝗼𝗻-𝗲𝘅𝗵𝗮𝘂𝘀𝘁𝗶𝘃𝗲 𝗹𝗶𝘀𝘁 𝗼𝗳 𝗺𝗲𝘁𝗮𝗱𝗮𝘁𝗮:

👉 Schema of the Data being Produced.

👇

Read 14 tweets

Aurimas Griciūnas

@Aurimas_Gr

Feb 22

@eugeneyan

What does a 𝗥𝗲𝗮𝗹 𝗧𝗶𝗺𝗲 𝗦𝗲𝗮𝗿𝗰𝗵 𝗼𝗿 𝗥𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱𝗲𝗿 𝗦𝘆𝘀𝘁𝗲𝗺 𝗗𝗲𝘀𝗶𝗴𝗻 look like?

The graph was inspired by the amazing work of @eugeneyan

More in the 🧵

#Data #DataEngineering #MLOps #MachineLearning #DataScience

Recommender and Search Systems are one of the biggest money makers for most companies when it comes to Machine Learning.

👇

Both Systems are inherently similar. Their goal is to return a list of recommended items given a certain context - it could be a search query in the e-commerce website or a list of recommended songs given that you are currently listening to a certain song on Spotify.

👇

Read 12 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Aurimas Griciūnas

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @Aurimas_Gr

Aurimas Griciūnas

Aurimas Griciūnas

Aurimas Griciūnas

Aurimas Griciūnas

Aurimas Griciūnas

Aurimas Griciūnas

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!