Zach Wilson Profile picture
Founder @ https://t.co/CWvLDHU2Lx $1m ARR | ADHD | 800k+ followers on all platforms | 10 yrs DE experience |ex @facebook, @netflix, and @airbnb
Jun 1, 2023 5 tweets 2 min read
Troubleshooting slow Spark jobs is a special type of data engineering torture!

What are the common culprits? 1/5 - Inadequate initial file parallelism If your upstream data tables are written to too few files, you can't increase the parallelism much unless you work with the upstream pipeline to write more files. This can increase the speed of your job dramatically! 2/5
May 1, 2023 5 tweets 2 min read
Things I’ve learned running my data engineering boot camp for 2 weeks!

- don’t use platforms like Maven which take 10%. Google classroom is free and good enough. 1/5 - be more than 1 day ahead of the boot camp. When curricula building you should be working on the next weeks curricula this week so your students can review the materials before.

- use Postgres 14+ on your SQL sections otherwise you don’t have access to BIT_COUNT function 2/5
Apr 14, 2023 4 tweets 1 min read
How to use ChatGPT to speed up data pipeline dev in a few easy steps:

1. Supply ChatGPT with your input schemas. Paste your CREATE TABLE statements directly. ChatGPT knows how to parse these and make sense of the fields inside

1/4
2. Specify what type of analytical pattern you want to do on this data. ChatGPT understands aggregation (for GROUP BY queries), cumulation (for FULL OUTER JOIN queries), enrichment (for JOIN queries) and slowly changing dimension patterns.

2/4
Apr 6, 2023 8 tweets 2 min read
My data eng boot camp curricula has been updated:

- Week 1: Dimension Data Modeling
Day 1: understanding dimensions. Daily dimensions vs SCDs. How to pick SCD type 1/2/3
Day 2: applied dimension data modeling. backfilling SCD tables. Incremental building SCD tables.
1/8
- Week 2: Fact Data Modeling
Day 1: understand facts. denormalized facts vs normalized facts. How to collaborate and get logging right.
Day 2: applied fact data modeling. Reduced facts for efficient long-term analysis. Cumulative table design for efficient fact analysis. 2/8
Mar 16, 2023 4 tweets 2 min read
Quick guide to go from 0 to #dataengineering hero:

- learn SQL
Data Lemur is a great resource here

- learn Python
Do like… 30-40 leetcode easy and medium questions

- distributed compute
Get a trial of Databricks or Snowflake and find a training to learn about it

1/3
- data modeling
Find a dimension table like users that you can snapshot daily. Learn about slowly changing dimensions.
Find a fact/event table that you can aggregate and learn about fact modeling

- job orchestration
Learn Mage or Airflow to do your daily automated tasks
2/3
Nov 5, 2022 5 tweets 1 min read
Senior data engineers often own master data sets. These are highly trusted data sets used by many people in the company.

Master data is handled and changed differently than regular data or raw data.

Here’s how master data is different in #dataengineering: 1/5 Changes to master data impact many more decisions. These changes need to be communicated effectively. If they aren’t, people will find discrepancies and lose trust in your data. This causes slower decisions and impacts revenue. 2/5
Oct 26, 2022 4 tweets 1 min read
Adhoc SQL queries and SQL queries running in production will generally look different. Copying the data scientist’s query into Airflow isn’t enough to be “production ready”

Here’s some things to look for in ad-hoc queries that should be changed before moving to production. 1/4 1. GROUP BY 1,2,3… / ORDER BY 1,2,3
This is used to speed up writing ad-hoc queries. Please spell them out in production.

2. SELECT *
This grabs all the columns quickly for ad-hoc queries. Please spell out * in production.
2/4
Oct 25, 2022 6 tweets 1 min read
Understanding the data value chain helps you be a much more effective data professional.
The steps usually are:

- data generation
- data processing
- data validation
- data analytics
- machine learning predictions

I’ll explain each step in this thread: 1/6 Data generation is owned by software engineers or data engineers.

If the data source is your app or website, SWEs should set up quality logging to get the data flow started.

If the data source is a 3rd party api, DEs should set up an ingestion pipeline with quality checks 2/6
Oct 24, 2022 5 tweets 1 min read
Starting out in the data field can be overwhelming. Should you be a data scientist? A data engineer? A data analyst? An ML engineer? The number of role options is overwhelming!

Here's some high-level guidance on how to pick between some of these roles.
1/5
You should become a data analyst if:
You like to investigate business problems. You like digging into the data like Sherlock Holmes and finding patterns that have business impact. You're fascinated by data visualization and building reports. 2/5
Oct 22, 2022 13 tweets 3 min read
How I went from junior data engineer (L3) at Facebook to staff data engineer (L6) at Airbnb in 4 years.

- I got hired at Facebook in 2016 as a junior data engineer. I had 2 years of experience and I realized that I probably got hired at the wrong level. (1/13) - Instead of getting bitter about it. I decided to show that I was an L4 DE.
- As a jr DE, I worked on notifications at FB. I created a metric called reachability which is: "can facebook reach you?" This was a good counter for growth impact which can be gamed with spam (2/13)