Aurimas Griciลซnas Profile picture
Mar 17 โ€ข 8 tweets โ€ข 4 min read
What is a correct Data Engineering Learning Path?

My thoughts in the ๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
I believe that the following is a correct order to start in ๐—ฌ๐—ผ๐˜‚๐—ฟ ๐——๐—ฎ๐˜๐—ฎ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ๐—ถ๐—ป๐—ด ๐—ฃ๐—ฎ๐˜๐—ต:

๐Ÿ‘‡
โžก๏ธ ๐—จ๐—ป๐—ฑ๐—ฒ๐—ฟ๐˜€๐˜๐—ฎ๐—ป๐—ฑ ๐—•๐—ฎ๐˜€๐—ถ๐—ฐ ๐—ฃ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€๐—ฒ๐˜€:

๐Ÿ‘‰ Data Extraction
๐Ÿ‘‰ Data Validation
๐Ÿ‘‰ Data Contracts
๐Ÿ‘‰ Loading Data into a DWH / Data Lake
๐Ÿ‘‰ Transformations in a DWH / Data Lake
๐Ÿ‘‰ Scheduling

๐Ÿ‘‡
โžก๏ธ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป ๐—บ๐—ผ๐˜€๐˜ ๐˜„๐—ถ๐—ฑ๐—ฒ๐—น๐˜† ๐˜‚๐˜€๐—ฒ๐—ฑ ๐˜๐—ผ๐—ผ๐—น๐—ถ๐—ป๐—ด ๐—ฏ๐˜† ๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜๐—ถ๐—ป๐—ด ๐—ฎ ๐—ฝ๐—ฒ๐—ฟ๐˜€๐—ผ๐—ป๐—ฎ๐—น ๐—ฝ๐—ฟ๐—ผ๐—ท๐—ฒ๐—ฐ๐˜ ๐˜๐—ต๐—ฎ๐˜ ๐—น๐—ฒ๐˜ƒ๐—ฒ๐—ฟ๐—ฎ๐—ด๐—ฒ๐˜€ ๐˜๐—ต๐—ฒ ๐˜๐—ฒ๐—ฐ๐—ต๐—ป๐—ผ๐—น๐—ผ๐—ด๐˜†:

๐Ÿ‘‰ Python
๐Ÿ‘‰ SQL

๐Ÿ‘‡
๐Ÿ‘‰ Airflow - ๐˜†๐—ฒ๐˜€ ๐—”๐—ถ๐—ฟ๐—ณ๐—น๐—ผ๐˜„, there are many who say that focusing only on Airflow as a scheduler is a narrow minded approach. Well, you will find Airflow in 99% of job ads - start with it, forget what people say.
๐Ÿ‘‰ Spark
๐Ÿ‘‰ DBT

๐Ÿ‘‡
โžก๏ธ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป ๐—™๐˜‚๐—ป๐—ฑ๐—ฎ๐—บ๐—ฒ๐—ป๐˜๐—ฎ๐—น๐˜€ ๐——๐—ฒ๐—ฒ๐—ฝ๐—น๐˜†:

๐Ÿ‘‰ Data Modeling
๐Ÿ‘‰ Distributed Compute
๐Ÿ‘‰ Stakeholder Management
๐Ÿ‘‰ System Design
๐Ÿ‘‰ โ€ฆ

๐Ÿ‘‡
โžก๏ธ ๐—–๐—ผ๐—ป๐˜๐—ถ๐—ป๐˜‚๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด/๐—ฆ๐—ฝ๐—ฒ๐—ฐ๐—ถ๐—ฎ๐—น๐—ถ๐˜‡๐—ถ๐—ป๐—ด:

๐Ÿ‘‰ Stream Processing
๐Ÿ‘‰ Feature Stores
๐Ÿ‘‰ Data Governance
๐Ÿ‘‰ DataOps
๐Ÿ‘‰ Different tooling to implement the same Basic Processes
๐Ÿ‘‰ ...

๐Ÿ‘‡
๐Ÿ‘‹ I am Aurimas.

I will help you Level Up in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space.

๐—™๐—ผ๐—น๐—น๐—ผ๐˜„ ๐—บ๐—ฒ and hit ๐Ÿ””

Join a growing community of 6500+ Data Enthusiasts by subscribing to my ๐—ก๐—ฒ๐˜„๐˜€๐—น๐—ฒ๐˜๐˜๐—ฒ๐—ฟ: newsletter.swirlai.com/p/sai-21-what-โ€ฆ

โ€ข โ€ข โ€ข

Missing some Tweet in this thread? You can try to force a refresh
ใ€€

Keep Current with Aurimas Griciลซnas

Aurimas Griciลซnas Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Aurimas_Gr

Mar 17
What are the basics of Writing Data to a Kafka Topic?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
Kafka is an extremely important ๐——๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ฒ๐—ฑ ๐— ๐—ฒ๐˜€๐˜€๐—ฎ๐—ด๐—ถ๐—ป๐—ด ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ to understand as it was the first of its kind and most of the new products are built on the ideas of Kafka.

๐—ฆ๐—ผ๐—บ๐—ฒ ๐—ด๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐—น ๐—ฑ๐—ฒ๐—ณ๐—ถ๐—ป๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€:

๐Ÿ‘‡
โžก๏ธ Clients writing to Kafka are called ๐—ฃ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐—ฒ๐—ฟ๐˜€,
โžก๏ธ Clients reading the Data are called ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ๐˜€.
โžก๏ธ Data is written into ๐—ง๐—ผ๐—ฝ๐—ถ๐—ฐ๐˜€ that can be compared to ๐—ง๐—ฎ๐—ฏ๐—น๐—ฒ๐˜€ ๐—ถ๐—ป ๐——๐—ฎ๐˜๐—ฎ๐—ฏ๐—ฎ๐˜€๐—ฒ๐˜€.

๐Ÿ‘‡
Read 8 tweets
Mar 16
So what is the difference between Row Based and Column Based file formats?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning
๐—ฅ๐—ผ๐˜„ ๐—•๐—ฎ๐˜€๐—ฒ๐—ฑ:

โžก๏ธ Rows on disk are stored in sequence.
โžก๏ธ New rows are written efficiently since you can write the entire row at once.

๐Ÿ‘‡
โžก๏ธ For select statements that target a subset of columns, reading is slower since you need to scan all sets of rows to retrieve one of the columns.

๐Ÿ‘‡
Read 8 tweets
Mar 15
What are the main use cases for Apache Kafka or any other Distributed Messaging System?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience Image
We have covered lots of concepts around Kafka already. But what are the most common use cases for The System that you are very likely to run into as a Data Engineer?

๐—Ÿ๐—ฒ๐˜โ€™๐˜€ ๐˜๐—ฎ๐—ธ๐—ฒ ๐—ฎ ๐—ฐ๐—น๐—ผ๐˜€๐—ฒ๐—ฟ ๐—น๐—ผ๐—ผ๐—ธ:

๐Ÿ‘‡
๐—ช๐—ฒ๐—ฏ๐˜€๐—ถ๐˜๐—ฒ ๐—”๐—ฐ๐˜๐—ถ๐˜ƒ๐—ถ๐˜๐˜† ๐—ง๐—ฟ๐—ฎ๐—ฐ๐—ธ๐—ถ๐—ป๐—ด.

โžก๏ธ The Original use case for Kafka by LinkedIn.
โžก๏ธ Events happening in the website like page views, conversions etc. are sent via a Gateway and piped to Kafka Topics.

๐Ÿ‘‡
Read 12 tweets
Mar 1
Considering switching to a ๐— ๐—Ÿ๐—ข๐—ฝ๐˜€ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ role?

My thought in the ๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
Usually MLOps Engineers are professionals tasked with building out the ML Platform in the organization.

๐Ÿ‘‡
This means that the skill set required is very broad - naturally very few people start off with the full set of skills you would need to brand yourself as a MLOps Engineer. This is why I would not choose this role if you are just entering the market.

๐Ÿ‘‡
Read 10 tweets
Feb 28
What is the difference between Splittable and Non-Splittable Files?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
You are very likely to run into a ๐——๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ฒ๐—ฑ ๐—–๐—ผ๐—บ๐—ฝ๐˜‚๐˜๐—ฒ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ ๐—ผ๐—ฟ ๐—™๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜„๐—ผ๐—ฟ๐—ธ in your career. It could be ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ, ๐—›๐—ถ๐˜ƒ๐—ฒ, ๐—ฃ๐—ฟ๐—ฒ๐˜€๐˜๐—ผ or any other.

๐Ÿ‘‡
Also, it is very likely that these Frameworks would be reading data from a distributed storage. It could be ๐—›๐——๐—™๐—ฆ, ๐—ฆ๐Ÿฏ etc.

๐Ÿ‘‡
Read 12 tweets
Feb 28
So how do we implement ๐—ฃ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—š๐—ฟ๐—ฎ๐—ฑ๐—ฒ ๐—•๐—ฎ๐˜๐—ฐ๐—ต ๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ in ๐—ง๐—ต๐—ฒ ๐— ๐—Ÿ๐—ข๐—ฝ๐˜€ ๐—ช๐—ฎ๐˜†?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
Letโ€™s zoom in:

๐Ÿญ: Everything starts in version control: Machine Learning Training Pipeline is defined in code, once merged to the main branch it is built and triggered.

๐Ÿ‘‡
๐Ÿฎ: Feature preprocessing stage: Features are retrieved from the Feature Store, validated and passed to the next stage. Any feature related metadata is saved to an Experiment Tracking System.

๐Ÿ‘‡
Read 13 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(