What are the basics of Writing Data to a Kafka Topic?
🧵
#Data#DataEngineering#MLOps#MachineLearning#DataScience
Kafka is an extremely important 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝗱 𝗠𝗲𝘀𝘀𝗮𝗴𝗶𝗻𝗴 𝗦𝘆𝘀𝘁𝗲𝗺 to understand as it was the first of its kind and most of the new products are built on the ideas of Kafka.
𝗦𝗼𝗺𝗲 𝗴𝗲𝗻𝗲𝗿𝗮𝗹 𝗱𝗲𝗳𝗶𝗻𝗶𝘁𝗶𝗼𝗻𝘀:
👇
Mar 16, 2023 • 8 tweets • 4 min read
So what is the difference between Row Based and Column Based file formats?
➡️ Rows on disk are stored in sequence.
➡️ New rows are written efficiently since you can write the entire row at once.
👇
Mar 15, 2023 • 12 tweets • 4 min read
What are the main use cases for Apache Kafka or any other Distributed Messaging System?
🧵
#Data#DataEngineering#MLOps#MachineLearning#DataScience
We have covered lots of concepts around Kafka already. But what are the most common use cases for The System that you are very likely to run into as a Data Engineer?
𝟭: Everything starts in version control: Machine Learning Training Pipeline is defined in code, once merged to the main branch it is built and triggered.
👇
Feb 27, 2023 • 13 tweets • 5 min read
How do we 𝗗𝗲𝗰𝗼𝗺𝗽𝗼𝘀𝗲 𝗥𝗲𝗮𝗹 𝗧𝗶𝗺𝗲 𝗠𝗮𝗰𝗵𝗶𝗻𝗲 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗦𝗲𝗿𝘃𝗶𝗰𝗲 𝗟𝗮𝘁𝗲𝗻𝗰𝘆 and why should you care to understand the pieces as a ML Engineer?
Find out in the 🧵
#Data#DataEngineering#MLOps#MachineLearning#DataScience
Usually, what is cared about by the users of your Machine Learning Service is the total endpoint latency - the time difference between when a request is performed (1.) against the Service till when the response is received (6.).
👇
Feb 23, 2023 • 15 tweets • 3 min read
Do you know how 𝗔𝗽𝗮𝗰𝗵𝗲 𝗦𝗽𝗮𝗿𝗸 𝗶𝘀 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝗲𝗱?
Find out in the 🧵
#Data#DataEngineering#MLOps#MachineLearning#DataScience
𝗔𝗽𝗮𝗰𝗵𝗲 𝗦𝗽𝗮𝗿𝗸 is an extremely popular distributed processing framework utilizing in-memory processing to speed up task execution. Most of its libraries are contained in the Spark Core layer.
👇
Feb 23, 2023 • 14 tweets • 5 min read
A refresher on the role of 𝗗𝗮𝘁𝗮 𝗖𝗼𝗻𝘁𝗿𝗮𝗰𝘁𝘀 in the Data Pipeline.
Read on in the 🧵
#Data#DataEngineering#MLOps#MachineLearning#DataScience
In its simplest form Data Contract is an agreement between Data Producers and Data Consumers on what the Data being produced should look like, what SLAs it should meet and the semantics of it.
👇
Feb 22, 2023 • 12 tweets • 5 min read
What does a 𝗥𝗲𝗮𝗹 𝗧𝗶𝗺𝗲 𝗦𝗲𝗮𝗿𝗰𝗵 𝗼𝗿 𝗥𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱𝗲𝗿 𝗦𝘆𝘀𝘁𝗲𝗺 𝗗𝗲𝘀𝗶𝗴𝗻 look like?
The graph was inspired by the amazing work of @eugeneyan
Let’s remind ourselves of how a 𝗥𝗲𝗾𝘂𝗲𝘀𝘁-𝗥𝗲𝘀𝗽𝗼𝗻𝘀𝗲 𝗠𝗼𝗱𝗲𝗹 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁 looks like - 𝗧𝗵𝗲 𝗠𝗟𝗢𝗽𝘀 𝗪𝗮𝘆.
🧵
#MLOps#MachineLearning#DataScience#Data
You will find this type of model deployment to be the most popular when it comes to Online Machine Learning Systems.
Let's zoom in:
𝟭: Version Control: Machine Learning Training Pipeline is defined in code, once merged to the main branch it is built and triggered.
👇
Dec 23, 2022 • 9 tweets • 4 min read
If I could only choose 5 books to read in 2023 as an aspiring Data Engineer these would be them in a specific order:
Read on in the Thread 👇
--------
Follow me and hit 🔔 to 𝗟𝗲𝘃𝗲𝗹 𝗨𝗽 in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space!
1️⃣ ”𝗙𝘂𝗻𝗱𝗮𝗺𝗲𝗻𝘁𝗮𝗹𝘀 𝗼𝗳 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴” - A book that I wish I had 5 years ago. After reading it you will understand the entire Data Engineering workflow. It will prepare you for further deep dives.
👇
Dec 22, 2022 • 15 tweets • 5 min read
What is a 𝗙𝗲𝗮𝘁𝘂𝗿𝗲 𝗦𝘁𝗼𝗿𝗲 and why is it such an important element in 𝗠𝗟𝗢𝗽𝘀 𝗦𝘁𝗮𝗰𝗸?
Find out in the Thread 👇
--------
𝗙𝗼𝗹𝗹𝗼𝘄 𝗺𝗲 and hit 🔔 to 𝗟𝗲𝘃𝗲𝗹 𝗨𝗽 in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space!
Feature Store System sits between Data Engineering and Machine Learning Pipelines and it solves the following issues:
➡️ Eliminates Training/Serving skew by syncing Batch and Online Serving Storages (5)
👇
Dec 21, 2022 • 15 tweets • 5 min read
Do you know what CDC(Change Data Capture) is and that there are multiple ways to implement it?
Find out in the Thread 👇
--------
𝗙𝗼𝗹𝗹𝗼𝘄 𝗺𝗲 and hit 🔔 to 𝗟𝗲𝘃𝗲𝗹 𝗨𝗽 in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space!
𝗖𝗵𝗮𝗻𝗴𝗲 𝗗𝗮𝘁𝗮 𝗖𝗮𝗽𝘁𝘂𝗿𝗲 is a software process used to replicate actions performed against Operational Databases for use in downstream applications.