Zach Wilson Profile picture
Apr 6 8 tweets 2 min read Twitter logo Read on Twitter
My data eng boot camp curricula has been updated:

- Week 1: Dimension Data Modeling
Day 1: understanding dimensions. Daily dimensions vs SCDs. How to pick SCD type 1/2/3
Day 2: applied dimension data modeling. backfilling SCD tables. Incremental building SCD tables.
1/8
- Week 2: Fact Data Modeling
Day 1: understand facts. denormalized facts vs normalized facts. How to collaborate and get logging right.
Day 2: applied fact data modeling. Reduced facts for efficient long-term analysis. Cumulative table design for efficient fact analysis. 2/8
- Week 3: Spark
Day 1: understanding when to use Spark. Deep dive into Spark architecture and bottlenecks
Day 2: applied Spark.
understand the parallelism vs network overhead. SparkSQL vs DataFrame vs Dataset.
I’ll lead Scala Spark and @ADutchEngineer will lead PySpark
3/8
- Week 4: Flink
— Day 1: understanding Flink. When to choose streaming vs batch. Deep dive into how Kafka works.
— Day 2: applied Flink.
create sessions to understand user behavior in real time. Learning about windows, sinks, and how to build real-time analytics. 4/8
- Week 5: Data Quality
Day 1: proactive vs reactive data quality. design idempotent pipelines that are unit tested in Spark. Catchnerrors in development vs in production
Day 2: production data quality checks deep dive. write-audit-publish pattern vs signal table pattern. 5/8
- Week 6: Data impact and storytelling
Day 1: learning how to make a case for a new data set to be created. How evangelize the data sets you’ve already created.
Day 2: data viz deep dive. How to build fast viz with pre-computed aggregates, GROUPING SETS and good design 6/8
- Bonus Week 7: ChatGPT-driven data engineering
— Day 1: learn how to reduce your workload by 70-80% by leveraging ChatGPT to write most of your queries and pipelines for you!
— Day 2: party with ChatGPT because now you’re an amazing data engineer! 7/8
The first iteration of this boot camp is full! If you’re interested in joining future iterations please subscribe to my newsletter at zachwilson.tech

#dataengineering

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Zach Wilson

Zach Wilson Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @EcZachly

Apr 14
How to use ChatGPT to speed up data pipeline dev in a few easy steps:

1. Supply ChatGPT with your input schemas. Paste your CREATE TABLE statements directly. ChatGPT knows how to parse these and make sense of the fields inside

1/4
2. Specify what type of analytical pattern you want to do on this data. ChatGPT understands aggregation (for GROUP BY queries), cumulation (for FULL OUTER JOIN queries), enrichment (for JOIN queries) and slowly changing dimension patterns.

2/4
3. ChatGPT needs more context on WHICH fields it should apply these patterns to. If you don’t supply that, it’ll give you flaming garbage.

Example: “apply slowly changing dimension type 2 pattern to fields age, gender, and phone os”

3/4
Read 4 tweets
Mar 16
Quick guide to go from 0 to #dataengineering hero:

- learn SQL
Data Lemur is a great resource here

- learn Python
Do like… 30-40 leetcode easy and medium questions

- distributed compute
Get a trial of Databricks or Snowflake and find a training to learn about it

1/3
- data modeling
Find a dimension table like users that you can snapshot daily. Learn about slowly changing dimensions.
Find a fact/event table that you can aggregate and learn about fact modeling

- job orchestration
Learn Mage or Airflow to do your daily automated tasks
2/3
- data story telling
Take a training by Tableau on data visualization and how to tell stories with data

- communication
Read crucial conversations and radical candor books. They’ll help a lot!

If you just do this, you’ll be a lot closer to a great data engineering job! 3/3
Read 4 tweets
Nov 5, 2022
Senior data engineers often own master data sets. These are highly trusted data sets used by many people in the company.

Master data is handled and changed differently than regular data or raw data.

Here’s how master data is different in #dataengineering: 1/5
Changes to master data impact many more decisions. These changes need to be communicated effectively. If they aren’t, people will find discrepancies and lose trust in your data. This causes slower decisions and impacts revenue. 2/5
Master data needs the highest quality possible. What does this look like?
- the pipeline has comprehensive data quality checks that stop data flow when they’re violated

- the pipeline uses write-audit-publish pattern so bad data leaking into production is minimized
3/5
Read 5 tweets
Oct 26, 2022
Adhoc SQL queries and SQL queries running in production will generally look different. Copying the data scientist’s query into Airflow isn’t enough to be “production ready”

Here’s some things to look for in ad-hoc queries that should be changed before moving to production. 1/4
1. GROUP BY 1,2,3… / ORDER BY 1,2,3
This is used to speed up writing ad-hoc queries. Please spell them out in production.

2. SELECT *
This grabs all the columns quickly for ad-hoc queries. Please spell out * in production.
2/4
3. LIMIT 1000
Limits should generally be removed when moving to production since you want the entire data set.

4. Sub queries
Sub queries should almost always be abstracted as CTEs when running in production.
3/4
Read 4 tweets
Oct 25, 2022
Understanding the data value chain helps you be a much more effective data professional.
The steps usually are:

- data generation
- data processing
- data validation
- data analytics
- machine learning predictions

I’ll explain each step in this thread: 1/6
Data generation is owned by software engineers or data engineers.

If the data source is your app or website, SWEs should set up quality logging to get the data flow started.

If the data source is a 3rd party api, DEs should set up an ingestion pipeline with quality checks 2/6
Data processing happens by data engineers. This is the world of data modeling and master data. DEs should create robust pipelines here with ample quality checks.
Using patterns like write-audit-publish here can be very powerful to increase quality. 3/6
Read 6 tweets
Oct 24, 2022
Starting out in the data field can be overwhelming. Should you be a data scientist? A data engineer? A data analyst? An ML engineer? The number of role options is overwhelming!

Here's some high-level guidance on how to pick between some of these roles.
1/5
You should become a data analyst if:
You like to investigate business problems. You like digging into the data like Sherlock Holmes and finding patterns that have business impact. You're fascinated by data visualization and building reports. 2/5
You should become a data scientist if:
You really like statistics. You like setting up experiments to see how different experiences impact user behavior. You have a knack for machine learning and can talk about the results from ML algorithms to less technical people. 3/5
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(