- learn Python
Do like… 30-40 leetcode easy and medium questions
- distributed compute
Get a trial of Databricks or Snowflake and find a training to learn about it
1/3
- data modeling
Find a dimension table like users that you can snapshot daily. Learn about slowly changing dimensions.
Find a fact/event table that you can aggregate and learn about fact modeling
- job orchestration
Learn Mage or Airflow to do your daily automated tasks
2/3
- data story telling
Take a training by Tableau on data visualization and how to tell stories with data
- communication
Read crucial conversations and radical candor books. They’ll help a lot!
If you just do this, you’ll be a lot closer to a great data engineering job! 3/3
If you want a more structured approach, I’ll be launching a boot camp in April. Subscribe to my newsletter zachwilson.tech to get updates about it!
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Changes to master data impact many more decisions. These changes need to be communicated effectively. If they aren’t, people will find discrepancies and lose trust in your data. This causes slower decisions and impacts revenue. 2/5
Master data needs the highest quality possible. What does this look like?
- the pipeline has comprehensive data quality checks that stop data flow when they’re violated
- the pipeline uses write-audit-publish pattern so bad data leaking into production is minimized
3/5
Adhoc SQL queries and SQL queries running in production will generally look different. Copying the data scientist’s query into Airflow isn’t enough to be “production ready”
Here’s some things to look for in ad-hoc queries that should be changed before moving to production. 1/4
1. GROUP BY 1,2,3… / ORDER BY 1,2,3
This is used to speed up writing ad-hoc queries. Please spell them out in production.
2. SELECT *
This grabs all the columns quickly for ad-hoc queries. Please spell out * in production.
2/4
3. LIMIT 1000
Limits should generally be removed when moving to production since you want the entire data set.
4. Sub queries
Sub queries should almost always be abstracted as CTEs when running in production.
3/4
Understanding the data value chain helps you be a much more effective data professional.
The steps usually are:
- data generation
- data processing
- data validation
- data analytics
- machine learning predictions
I’ll explain each step in this thread: 1/6
Data generation is owned by software engineers or data engineers.
If the data source is your app or website, SWEs should set up quality logging to get the data flow started.
If the data source is a 3rd party api, DEs should set up an ingestion pipeline with quality checks 2/6
Data processing happens by data engineers. This is the world of data modeling and master data. DEs should create robust pipelines here with ample quality checks.
Using patterns like write-audit-publish here can be very powerful to increase quality. 3/6
Starting out in the data field can be overwhelming. Should you be a data scientist? A data engineer? A data analyst? An ML engineer? The number of role options is overwhelming!
Here's some high-level guidance on how to pick between some of these roles.
1/5
You should become a data analyst if:
You like to investigate business problems. You like digging into the data like Sherlock Holmes and finding patterns that have business impact. You're fascinated by data visualization and building reports. 2/5
You should become a data scientist if:
You really like statistics. You like setting up experiments to see how different experiences impact user behavior. You have a knack for machine learning and can talk about the results from ML algorithms to less technical people. 3/5
How I went from junior data engineer (L3) at Facebook to staff data engineer (L6) at Airbnb in 4 years.
- I got hired at Facebook in 2016 as a junior data engineer. I had 2 years of experience and I realized that I probably got hired at the wrong level. (1/13)
- Instead of getting bitter about it. I decided to show that I was an L4 DE.
- As a jr DE, I worked on notifications at FB. I created a metric called reachability which is: "can facebook reach you?" This was a good counter for growth impact which can be gamed with spam (2/13)
- This impact was "greatly exceeding expectations" when my review near my 1-year anniversary happened. I got promoted from L3 to L4 after this.
- I felt determined to do mid to senior in a year as well. So I started working long hours on many different business areas. (3/13)