Aurimas Griciลซnas Profile picture
Feb 1 โ€ข 14 tweets โ€ข 5 min read
๐—ก๐—ผ ๐—˜๐˜…๐—ฐ๐˜‚๐˜€๐—ฒ๐˜€ ๐——๐—ฎ๐˜๐—ฎ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ๐—ถ๐—ป๐—ด ๐—ฃ๐—ผ๐—ฟ๐˜๐—ณ๐—ผ๐—น๐—ถ๐—ผ ๐—ง๐—ฒ๐—บ๐—ฝ๐—น๐—ฎ๐˜๐—ฒ - next week I will enrich it with the missing Machine Learning and MLOps parts!

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
Today - letโ€™s review it once more. It is super helpful as these kind of Data Architectures are what you will find in real life situations.

๐—ฅ๐—ฒ๐—ฐ๐—ฎ๐—ฝ:

๐Ÿ‘‡
๐Ÿญ. Data Producers - Python Applications that extract data from chosen Data Sources and push it to Collector via REST or gRPC API calls.

๐Ÿ‘‡
2. Collector - REST or gRPC server written in Python that takes a payload, validates top level field existence and correctness, adds additional metadata and pushes it into either Raw Events Topic if the validation passes or a Dead Letter Queue if top level fields are invalid.

๐Ÿ‘‡
3. Enricher/Validator - Python or Spark Application that validates event schema in Raw Events Topic, performs data enrichment and pushes results into either Enriched Events Topic if validation passes and enrichment succeeds or a Dead Letter Queue if any of previous failed.

๐Ÿ‘‡
๐Ÿฐ. Enrichment API - API of any flavor implemented with Python that can be called for enrichment purposes by Enricher/Validator. This could be a Machine Learning Model deployed as an API as well.

๐Ÿ‘‡
๐Ÿฑ. Real Time Loader - Python or Spark Application that reads data from Enriched Events and Enriched Events Dead Letter Topics and writes them in real time to ElasticSearch Indexes for Analysis and alerting.

๐Ÿ‘‡
๐Ÿฒ. Batch Loader - Python or Spark Application that reads data from Enriched Events Topic, batches it in memory and writes to MinIO Object Storage.

๐Ÿ‘‡
๐Ÿณ. Scripts Scheduled via Airflow that read data from Enriched Events MinIO bucket, validates data quality, performs deduplication and any additional Enrichments. Here you also construct your Data Model to be later used for reporting purposes.

๐Ÿ‘‡
๐—ง๐Ÿญ. A single Kafka instance that will hold all of the Topics for the Project.
๐—ง๐Ÿฎ. A single MinIO instance that will hold all of the Buckets for the Project.
๐—ง๐Ÿฏ. Airflow instance that will allow you to schedule Python or Spark Batch jobs against data stored in MinIO.

๐Ÿ‘‡
๐—ง๐Ÿฐ. Presto/Trino cluster that you mount on top of Curated Data in MinIO so that you can query it using Superset.
๐—ง๐Ÿฑ: ElasticSearch instance to hold Real Time Data.

๐Ÿ‘‡
๐—ง๐Ÿฒ. Superset Instance that you mount on top of Trino Querying Engine for Batch Analytics and Elasticsearch for Real Time Analytics.

๐Ÿ‘‡
๐Ÿ‘‹ I am Aurimas.

I will help you Level Up in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space.

๐—™๐—ผ๐—น๐—น๐—ผ๐˜„ ๐—บ๐—ฒ and hit ๐Ÿ””

Join a growing community of 5200+ Data Enthusiasts by subscribing to my ๐—ก๐—ฒ๐˜„๐˜€๐—น๐—ฒ๐˜๐˜๐—ฒ๐—ฟ: swirlai.substack.com/p/sai-15-whatsโ€ฆ
Also join the SwirlAI Data Talent Collective that will help you find your next career step: swirlai.pallet.com/talent/welcomeโ€ฆ

โ€ข โ€ข โ€ข

Missing some Tweet in this thread? You can try to force a refresh
ใ€€

Keep Current with Aurimas Griciลซnas

Aurimas Griciลซnas Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Aurimas_Gr

Jan 31
What are ๐—Ÿ๐—ฎ๐—บ๐—ฏ๐—ฑ๐—ฎ ๐—ฎ๐—ป๐—ฑ ๐—ž๐—ฎ๐—ฝ๐—ฝ๐—ฎ ๐—”๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ๐˜€?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
Lambda and Kappa are both Data architectures proposed to solve movement of large amounts of data for reliable Online access.

๐Ÿ‘‡
The most popular architecture has been and continues to be Lambda. However, with Stream Processing becoming more accessible to organizations of every size you will be hearing a lot more of Kappa in the near future. Letโ€™s see how they are different.

๐Ÿ‘‡
Read 15 tweets
Jan 30
Letโ€™s remind ourselves of how a ๐—ฅ๐—ฒ๐—พ๐˜‚๐—ฒ๐˜€๐˜-๐—ฅ๐—ฒ๐˜€๐—ฝ๐—ผ๐—ป๐˜€๐—ฒ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐——๐—ฒ๐—ฝ๐—น๐—ผ๐˜†๐—บ๐—ฒ๐—ป๐˜ looks like - ๐—ง๐—ต๐—ฒ ๐— ๐—Ÿ๐—ข๐—ฝ๐˜€ ๐—ช๐—ฎ๐˜†.

๐Ÿงต

#MLOps #MachineLearning #DataScience #Data Image
You will find this type of model deployment to be the most popular when it comes to Online Machine Learning Systems.

Let's zoom in:

๐Ÿญ: Version Control: Machine Learning Training Pipeline is defined in code, once merged to the main branch it is built and triggered.

๐Ÿ‘‡
๐Ÿฎ: Feature Preprocessing: Features are retrieved from the Feature Store, validated and passed to the next stage. Any feature related metadata that is tightly coupled to the Model being trained is saved to the Experiment Tracking System.

๐Ÿ‘‡
Read 14 tweets
Dec 23, 2022
If I could only choose 5 books to read in 2023 as an aspiring Data Engineer these would be them in a specific order:

Read on in the Thread ๐Ÿ‘‡

--------

Follow me and hit ๐Ÿ”” to ๐—Ÿ๐—ฒ๐˜ƒ๐—ฒ๐—น ๐—จ๐—ฝ in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space!
1๏ธโƒฃ โ€๐—™๐˜‚๐—ป๐—ฑ๐—ฎ๐—บ๐—ฒ๐—ป๐˜๐—ฎ๐—น๐˜€ ๐—ผ๐—ณ ๐——๐—ฎ๐˜๐—ฎ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ๐—ถ๐—ป๐—ดโ€ - A book that I wish I had 5 years ago. After reading it you will understand the entire Data Engineering workflow. It will prepare you for further deep dives.

๐Ÿ‘‡
2๏ธโƒฃ โ€๐—”๐—ฐ๐—ฐ๐—ฒ๐—น๐—ฒ๐—ฟ๐—ฎ๐˜๐—ฒโ€ - Data Engineers should follow the same practices that Software Engineers do and more. After reading this book you will understand DevOps practices in and out.

๐Ÿ‘‡
Read 9 tweets
Dec 22, 2022
What is a ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฆ๐˜๐—ผ๐—ฟ๐—ฒ and why is it such an important element in ๐— ๐—Ÿ๐—ข๐—ฝ๐˜€ ๐—ฆ๐˜๐—ฎ๐—ฐ๐—ธ?

Find out in the Thread ๐Ÿ‘‡

--------

๐—™๐—ผ๐—น๐—น๐—ผ๐˜„ ๐—บ๐—ฒ and hit ๐Ÿ”” to ๐—Ÿ๐—ฒ๐˜ƒ๐—ฒ๐—น ๐—จ๐—ฝ in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space! Image
Feature Store System sits between Data Engineering and Machine Learning Pipelines and it solves the following issues:

โžก๏ธ Eliminates Training/Serving skew by syncing Batch and Online Serving Storages (5)

๐Ÿ‘‡
โžก๏ธ Enables Feature Sharing and Discoverability through the Metadata Layer - you define the Feature Transformations once, enable discoverability through the Feature Catalog and then serve Feature Sets for training and inference purposes trough unified interface (4๏ธ,3).

๐Ÿ‘‡
Read 15 tweets
Dec 21, 2022
Do you know what CDC(Change Data Capture) is and that there are multiple ways to implement it?

Find out in the Thread ๐Ÿ‘‡

--------

๐—™๐—ผ๐—น๐—น๐—ผ๐˜„ ๐—บ๐—ฒ and hit ๐Ÿ”” to ๐—Ÿ๐—ฒ๐˜ƒ๐—ฒ๐—น ๐—จ๐—ฝ in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space! Image
๐—–๐—ต๐—ฎ๐—ป๐—ด๐—ฒ ๐——๐—ฎ๐˜๐—ฎ ๐—–๐—ฎ๐—ฝ๐˜๐˜‚๐—ฟ๐—ฒ is a software process used to replicate actions performed against Operational Databases for use in downstream applications.

๐—ง๐—ต๐—ฒ๐—ฟ๐—ฒ ๐—ฎ๐—ฟ๐—ฒ ๐˜€๐—ฒ๐˜ƒ๐—ฒ๐—ฟ๐—ฎ๐—น ๐˜‚๐˜€๐—ฒ ๐—ฐ๐—ฎ๐˜€๐—ฒ๐˜€ ๐—ณ๐—ผ๐—ฟ CDC. ๐—ง๐˜„๐—ผ ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ ๐—บ๐—ฎ๐—ถ๐—ป ๐—ผ๐—ป๐—ฒ๐˜€:

๐Ÿ‘‡
โžก๏ธ ๐——๐—ฎ๐˜๐—ฎ๐—ฏ๐—ฎ๐˜€๐—ฒ ๐—ฅ๐—ฒ๐—ฝ๐—น๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป (refer to 3๏ธโƒฃ in the Diagram).

๐Ÿ‘‰ CDC can be used for moving transactions performed against Source Database to a Target DB. If each transaction is replicated - it is possible to retain all ACID guarantees when performing replication.

๐Ÿ‘‡
Read 15 tweets
Dec 21, 2022
What does good Model Tracking System look like?

Find out in the Thread ๐Ÿ‘‡

--------

๐—™๐—ผ๐—น๐—น๐—ผ๐˜„ ๐—บ๐—ฒ and hit ๐Ÿ”” to ๐—Ÿ๐—ฒ๐˜ƒ๐—ฒ๐—น ๐—จ๐—ฝ in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space! Image
It should be composed of two integrated parts: Experiment Tracking System and a Model Registry.

From where you track ML Pipeline metadata will depend on MLOps maturity in your company.

If you are at the beginning of the ML journey you might be:

๐Ÿ‘‡
1๏ธโƒฃ Training and Serving your Models from experimentation environment - you run ML Pipelines inside of your Notebook and do that manually at each retraining.

If you are beyond Notebooks you will be running ML Pipelines from CI/CD Pipelines and on Orchestrator triggers.

๐Ÿ‘‡
Read 14 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(