Discover and read the best of Twitter Threads about #dataengineering

Most recents (24)

Troubleshooting slow Spark jobs is a special type of data engineering torture!

What are the common culprits? 1/5
- Inadequate initial file parallelism If your upstream data tables are written to too few files, you can't increase the parallelism much unless you work with the upstream pipeline to write more files. This can increase the speed of your job dramatically! 2/5
- Not enough memory/disk spillage
Disk spillage happens when Spark needs more RAM to process the data and uses ROM instead. Disk spillage hurts perf! You can solve bumping up executor memory or increasing parallelism. This can also be caused by skew 3/5
Read 5 tweets
Testing your pipelines before merging is crucial to ensure they do not fail in production. However, testing data pipelines is complex (and expensive) due to the data size, confidentiality, and time it takes to test a data pipeline.
๐Ÿงต
#data #dataengineering #testing #dataops
Here are a few ways to get data for your tests:

1. Copying data: An exact copy of the prod data for testing will ensure that our changes are not breaking the pipeline. This is expensive! You can use a part of data for testing, accepting possible edge case misses.
2. Data git: Projects like Nessie and LakeFS can help set up different environments without replicating entire data.
Read 7 tweets
Data engineers work with multiple systems & it's crucial to understand DevOps. Shown below are a few DevOps concepts to familiarize oneself with:

1. Docker: docs.docker.com/get-started/
2. Kubernetes: kubernetes.io/docs/concepts/โ€ฆ
3. CI/CD: resources.github.com/ci-cd/

#dataengineering
#data
That's a wrap!

If you enjoyed this thread:

1. Follow me @startdataeng for more of these
2. RT the tweet below to share this thread with your audience
Read 4 tweets
Things Iโ€™ve learned running my data engineering boot camp for 2 weeks!

- donโ€™t use platforms like Maven which take 10%. Google classroom is free and good enough. 1/5
- be more than 1 day ahead of the boot camp. When curricula building you should be working on the next weeks curricula this week so your students can review the materials before.

- use Postgres 14+ on your SQL sections otherwise you donโ€™t have access to BIT_COUNT function 2/5
- use 1 long-standing Zoom call instead of 1 Zoom call per session. This makes it easier for students to know where to go and donโ€™t use the Google Calendar + Zoom integration, itโ€™s really bad!

- give people more time on the homework. Strict deadlines are for college not bootโ€ฆ twitter.com/i/web/status/1โ€ฆ
Read 5 tweets
How to use ChatGPT to speed up data pipeline dev in a few easy steps:

1. Supply ChatGPT with your input schemas. Paste your CREATE TABLE statements directly. ChatGPT knows how to parse these and make sense of the fields inside

1/4
2. Specify what type of analytical pattern you want to do on this data. ChatGPT understands aggregation (for GROUP BY queries), cumulation (for FULL OUTER JOIN queries), enrichment (for JOIN queries) and slowly changing dimension patterns.

2/4
3. ChatGPT needs more context on WHICH fields it should apply these patterns to. If you donโ€™t supply that, itโ€™ll give you flaming garbage.

Example: โ€œapply slowly changing dimension type 2 pattern to fields age, gender, and phone osโ€

3/4
Read 4 tweets
My data eng boot camp curricula has been updated:

- Week 1: Dimension Data Modeling
Day 1: understanding dimensions. Daily dimensions vs SCDs. How to pick SCD type 1/2/3
Day 2: applied dimension data modeling. backfilling SCD tables. Incremental building SCD tables.
1/8
- Week 2: Fact Data Modeling
Day 1: understand facts. denormalized facts vs normalized facts. How to collaborate and get logging right.
Day 2: applied fact data modeling. Reduced facts for efficient long-term analysis. Cumulative table design for efficient fact analysis. 2/8
- Week 3: Spark
Day 1: understanding when to use Spark. Deep dive into Spark architecture and bottlenecks
Day 2: applied Spark.
understand the parallelism vs network overhead. SparkSQL vs DataFrame vs Dataset.
Iโ€™ll lead Scala Spark and @ADutchEngineer will lead PySpark
3/8
Read 8 tweets
Live from #GartnerDA | 5 Ways to Enhance Your Data Engineering Practices with Robert Thanaraj, Gartner Director Analyst: gtnr.it/3JOkYPF
About this session: Analytics relies on a successful data foundation; it must be backed with the right data and processes. Robert explores 5 ways to enhance your #DataEngineering practices: gtnr.it/3JOkYPF #GartnerDA
#DataEngineering is a critical skill in high demand amongst employers with a 7.5% increase in demand. #GartnerDA
Read 13 tweets
" SQL Puzzle Interview Question "

๐Ÿงต
Table script:

create table input (
id int,
formula varchar(10),
value int
)
insert into input values (1,'1+4',10),(2,'2+1',5),(3,'3-2',40),(4,'4-1',20);
Read 4 tweets
" Exploratory Data Analysis on Terrorism "

๐Ÿงต
We are performed Exploratory Data Analysis on terrorism #dataset to find out the hot zone of #terrorism. #EDA nothing but #analyzing the given data & finding the #trends, patterns & making some assumptions. #DataVisualization #DataScience #MachineLearning
In this #dataset, there are many features including countries, states, regions, gang names, weapon types, target types, years, months, days, and many more features.
Read 8 tweets
What is a correct Data Engineering Learning Path?

My thoughts in the ๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
I believe that the following is a correct order to start in ๐—ฌ๐—ผ๐˜‚๐—ฟ ๐——๐—ฎ๐˜๐—ฎ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ๐—ถ๐—ป๐—ด ๐—ฃ๐—ฎ๐˜๐—ต:

๐Ÿ‘‡
โžก๏ธ ๐—จ๐—ป๐—ฑ๐—ฒ๐—ฟ๐˜€๐˜๐—ฎ๐—ป๐—ฑ ๐—•๐—ฎ๐˜€๐—ถ๐—ฐ ๐—ฃ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€๐—ฒ๐˜€:

๐Ÿ‘‰ Data Extraction
๐Ÿ‘‰ Data Validation
๐Ÿ‘‰ Data Contracts
๐Ÿ‘‰ Loading Data into a DWH / Data Lake
๐Ÿ‘‰ Transformations in a DWH / Data Lake
๐Ÿ‘‰ Scheduling

๐Ÿ‘‡
Read 8 tweets
What are the basics of Writing Data to a Kafka Topic?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
Kafka is an extremely important ๐——๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ฒ๐—ฑ ๐— ๐—ฒ๐˜€๐˜€๐—ฎ๐—ด๐—ถ๐—ป๐—ด ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ to understand as it was the first of its kind and most of the new products are built on the ideas of Kafka.

๐—ฆ๐—ผ๐—บ๐—ฒ ๐—ด๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐—น ๐—ฑ๐—ฒ๐—ณ๐—ถ๐—ป๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€:

๐Ÿ‘‡
โžก๏ธ Clients writing to Kafka are called ๐—ฃ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐—ฒ๐—ฟ๐˜€,
โžก๏ธ Clients reading the Data are called ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ๐˜€.
โžก๏ธ Data is written into ๐—ง๐—ผ๐—ฝ๐—ถ๐—ฐ๐˜€ that can be compared to ๐—ง๐—ฎ๐—ฏ๐—น๐—ฒ๐˜€ ๐—ถ๐—ป ๐——๐—ฎ๐˜๐—ฎ๐—ฏ๐—ฎ๐˜€๐—ฒ๐˜€.

๐Ÿ‘‡
Read 8 tweets
Quick guide to go from 0 to #dataengineering hero:

- learn SQL
Data Lemur is a great resource here

- learn Python
Do likeโ€ฆ 30-40 leetcode easy and medium questions

- distributed compute
Get a trial of Databricks or Snowflake and find a training to learn about it

1/3
- data modeling
Find a dimension table like users that you can snapshot daily. Learn about slowly changing dimensions.
Find a fact/event table that you can aggregate and learn about fact modeling

- job orchestration
Learn Mage or Airflow to do your daily automated tasks
2/3
- data story telling
Take a training by Tableau on data visualization and how to tell stories with data

- communication
Read crucial conversations and radical candor books. Theyโ€™ll help a lot!

If you just do this, youโ€™ll be a lot closer to a great data engineering job! 3/3
Read 4 tweets
So what is the difference between Row Based and Column Based file formats?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning
๐—ฅ๐—ผ๐˜„ ๐—•๐—ฎ๐˜€๐—ฒ๐—ฑ:

โžก๏ธ Rows on disk are stored in sequence.
โžก๏ธ New rows are written efficiently since you can write the entire row at once.

๐Ÿ‘‡
โžก๏ธ For select statements that target a subset of columns, reading is slower since you need to scan all sets of rows to retrieve one of the columns.

๐Ÿ‘‡
Read 8 tweets
What are the main use cases for Apache Kafka or any other Distributed Messaging System?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience Image
We have covered lots of concepts around Kafka already. But what are the most common use cases for The System that you are very likely to run into as a Data Engineer?

๐—Ÿ๐—ฒ๐˜โ€™๐˜€ ๐˜๐—ฎ๐—ธ๐—ฒ ๐—ฎ ๐—ฐ๐—น๐—ผ๐˜€๐—ฒ๐—ฟ ๐—น๐—ผ๐—ผ๐—ธ:

๐Ÿ‘‡
๐—ช๐—ฒ๐—ฏ๐˜€๐—ถ๐˜๐—ฒ ๐—”๐—ฐ๐˜๐—ถ๐˜ƒ๐—ถ๐˜๐˜† ๐—ง๐—ฟ๐—ฎ๐—ฐ๐—ธ๐—ถ๐—ป๐—ด.

โžก๏ธ The Original use case for Kafka by LinkedIn.
โžก๏ธ Events happening in the website like page views, conversions etc. are sent via a Gateway and piped to Kafka Topics.

๐Ÿ‘‡
Read 12 tweets
Roadmap to becoming Data Analyst in three months absolutely free. No need to pay a penny for this.

I have mentioned a roadmap with free resources.

A thread๐Ÿงต๐Ÿ‘‡
1. First Month Foundations of Data Analysis

A. Corey Schafer - Python Tutorials for Beginners:
B. StatQuest with Josh Starmer - Statistics Fundamentals:
C. Ken Jee - Data Analysis with Python
2. Second Month - Advanced Data Analysis Techniques

A. Sentdex - Machine Learning with Python
B. StatQuest with Josh Starmer - Machine Learning Fundamentals
C. Brandon Foltz - Business Analytics
Read 6 tweets
Python project ideas for beginners with source code

A thread ๐Ÿงต๐Ÿ‘‡
1. Calculator App
Source Code Link: github.com/programiz/Calcโ€ฆ
2. Expense Tracker
Source Code Link: github.com/prtm/Expense-Tโ€ฆ
Read 7 tweets
Python for data science beginners roadmap

A thread ๐Ÿงต๐Ÿ‘‡
1. Python Basics
Codecademy's Python Course (codecademy.com/learn/learn-pyโ€ฆ)
Python for Everybody Course (py4e.com)
2. Data Analysis Libraries
NumPy User Guide (numpy.org/doc/stable/useโ€ฆ)
Pandas User Guide (pandas.pydata.org/docs/user_guidโ€ฆ)
Matplotlib Tutorials (matplotlib.org/stable/tutoriaโ€ฆ)
Read 7 tweets
How should you ๐ฅ๐ž๐š๐ซ๐ง ๐๐จ๐ฐ๐ž๐ซ ๐๐ˆ? ๐Ÿš€

๐Ÿงต
It's easy to be overwhelmed by how broad #PowerBI is๐Ÿ˜–

If you're starting out, here's the path I recommendโฌ
๐Ÿ“Š ๐ƒ๐š๐ญ๐š ๐Œ๐จ๐๐ž๐ฅ๐ข๐ง๐ 
Begin by learning to organize data into tables, create relationships, and add calculated columns and measures.

This is the most important part of your journey, as understanding the #data is always the first step in Power BI development.
Read 13 tweets
โ–ถ๏ธPractice Writing SQL Queries using Real
Dataset ๐Ÿ’ฏ

๐Ÿงต
โ€œThe very first thing, we must do when writing #SQL queries, is to understand the underlying data. Once we understand the data and how this data is stored across different tables, it becomes much simpler to write SQL #Queries to retrieve any information from that dataโ€
โœ…List of SQL Queries:

We shall write SQL #Queries using this data. For each of these queries, you would find the problem statement and then the screen shot of the expected output. Under each of these 20 problem statement
Read 9 tweets
Considering switching to a ๐— ๐—Ÿ๐—ข๐—ฝ๐˜€ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ role?

My thought in the ๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
Usually MLOps Engineers are professionals tasked with building out the ML Platform in the organization.

๐Ÿ‘‡
This means that the skill set required is very broad - naturally very few people start off with the full set of skills you would need to brand yourself as a MLOps Engineer. This is why I would not choose this role if you are just entering the market.

๐Ÿ‘‡
Read 10 tweets
What is the difference between Splittable and Non-Splittable Files?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
You are very likely to run into a ๐——๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ฒ๐—ฑ ๐—–๐—ผ๐—บ๐—ฝ๐˜‚๐˜๐—ฒ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ ๐—ผ๐—ฟ ๐—™๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜„๐—ผ๐—ฟ๐—ธ in your career. It could be ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ, ๐—›๐—ถ๐˜ƒ๐—ฒ, ๐—ฃ๐—ฟ๐—ฒ๐˜€๐˜๐—ผ or any other.

๐Ÿ‘‡
Also, it is very likely that these Frameworks would be reading data from a distributed storage. It could be ๐—›๐——๐—™๐—ฆ, ๐—ฆ๐Ÿฏ etc.

๐Ÿ‘‡
Read 12 tweets
So how do we implement ๐—ฃ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—š๐—ฟ๐—ฎ๐—ฑ๐—ฒ ๐—•๐—ฎ๐˜๐—ฐ๐—ต ๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ in ๐—ง๐—ต๐—ฒ ๐— ๐—Ÿ๐—ข๐—ฝ๐˜€ ๐—ช๐—ฎ๐˜†?

๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience
Letโ€™s zoom in:

๐Ÿญ: Everything starts in version control: Machine Learning Training Pipeline is defined in code, once merged to the main branch it is built and triggered.

๐Ÿ‘‡
๐Ÿฎ: Feature preprocessing stage: Features are retrieved from the Feature Store, validated and passed to the next stage. Any feature related metadata is saved to an Experiment Tracking System.

๐Ÿ‘‡
Read 13 tweets
How do we ๐——๐—ฒ๐—ฐ๐—ผ๐—บ๐—ฝ๐—ผ๐˜€๐—ฒ ๐—ฅ๐—ฒ๐—ฎ๐—น ๐—ง๐—ถ๐—บ๐—ฒ ๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐—ฆ๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ฐ๐—ฒ ๐—Ÿ๐—ฎ๐˜๐—ฒ๐—ป๐—ฐ๐˜† and why should you care to understand the pieces as a ML Engineer?

Find out in the ๐Ÿงต

#Data #DataEngineering #MLOps #MachineLearning #DataScience Image
Usually, what is cared about by the users of your Machine Learning Service is the total endpoint latency - the time difference between when a request is performed (1.) against the Service till when the response is received (6.).

๐Ÿ‘‡
Certain SLAs will be established on what the acceptable latency is and you will need to reach that. Being able to decompose the total latency is even more important as you can improve each piece independently. Let's see how.

๐Ÿ‘‡
Read 13 tweets
El ecosistema de ingenierรญa de datos evoluciona a altรญsima velocidad. Ya es tiempo de que subas de level y conozcas mรกs allรก de numpy, pandas y matplotlib.

๐ŸAbro hilo pythรณnico

๐Ÿงต[1/x]

#python #dataengineering
Redpanda ๐Ÿผ : redpanda.com

Redpanda ofrece un performance superior a Apache Kafka y manteniendo la compatibilidad con el API.

ยฟSerรก tan poderoso como Google PubSub?

๐Ÿงต[2/x]

#python #dataengineering
DuckDB ๐Ÿฆ† : duckdb.org

DuckDB nos permite hacer OLAP desde nuestro navegador web y tener un motor que funciona bastante bien con Parquet. MotherDuck motherduck.com estรก buscando ofrecer como Saas DuckDB a gran escala.

๐Ÿงต[3/x]

#python #dataengineering
Read 8 tweets

Related hashtags

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!