Tweet

More from @EcZachly

Zach Wilson

@EcZachly

May 1

Things I’ve learned running my data engineering boot camp for 2 weeks!

- don’t use platforms like Maven which take 10%. Google classroom is free and good enough. 1/5

- be more than 1 day ahead of the boot camp. When curricula building you should be working on the next weeks curricula this week so your students can review the materials before.

- use Postgres 14+ on your SQL sections otherwise you don’t have access to BIT_COUNT function 2/5

twitter.com/i/web/status/1…

- use 1 long-standing Zoom call instead of 1 Zoom call per session. This makes it easier for students to know where to go and don’t use the Google Calendar + Zoom integration, it’s really bad!

- give people more time on the homework. Strict deadlines are for college not boot… twitter.com/i/web/status/1…

Read 5 tweets

Zach Wilson

@EcZachly

Apr 14

How to use ChatGPT to speed up data pipeline dev in a few easy steps:

1. Supply ChatGPT with your input schemas. Paste your CREATE TABLE statements directly. ChatGPT knows how to parse these and make sense of the fields inside

1/4

2. Specify what type of analytical pattern you want to do on this data. ChatGPT understands aggregation (for GROUP BY queries), cumulation (for FULL OUTER JOIN queries), enrichment (for JOIN queries) and slowly changing dimension patterns.

2/4

3. ChatGPT needs more context on WHICH fields it should apply these patterns to. If you don’t supply that, it’ll give you flaming garbage.

Example: “apply slowly changing dimension type 2 pattern to fields age, gender, and phone os”

3/4

Read 4 tweets

Zach Wilson

@EcZachly

Apr 6

My data eng boot camp curricula has been updated:

- Week 1: Dimension Data Modeling
Day 1: understanding dimensions. Daily dimensions vs SCDs. How to pick SCD type 1/2/3
Day 2: applied dimension data modeling. backfilling SCD tables. Incremental building SCD tables.
1/8

- Week 2: Fact Data Modeling
Day 1: understand facts. denormalized facts vs normalized facts. How to collaborate and get logging right.
Day 2: applied fact data modeling. Reduced facts for efficient long-term analysis. Cumulative table design for efficient fact analysis. 2/8

@ADutchEngineer

- Week 3: Spark
Day 1: understanding when to use Spark. Deep dive into Spark architecture and bottlenecks
Day 2: applied Spark.
understand the parallelism vs network overhead. SparkSQL vs DataFrame vs Dataset.
I’ll lead Scala Spark and @ADutchEngineer will lead PySpark
3/8

Read 8 tweets

Zach Wilson

@EcZachly

Mar 16

Quick guide to go from 0 to #dataengineering hero:

- learn SQL
Data Lemur is a great resource here

- learn Python
Do like… 30-40 leetcode easy and medium questions

- distributed compute
Get a trial of Databricks or Snowflake and find a training to learn about it

1/3

- data modeling
Find a dimension table like users that you can snapshot daily. Learn about slowly changing dimensions.
Find a fact/event table that you can aggregate and learn about fact modeling

- job orchestration
Learn Mage or Airflow to do your daily automated tasks
2/3

- data story telling
Take a training by Tableau on data visualization and how to tell stories with data

- communication
Read crucial conversations and radical candor books. They’ll help a lot!

If you just do this, you’ll be a lot closer to a great data engineering job! 3/3

Read 4 tweets

Zach Wilson

@EcZachly

Nov 5, 2022

Senior data engineers often own master data sets. These are highly trusted data sets used by many people in the company.

Master data is handled and changed differently than regular data or raw data.

Here’s how master data is different in #dataengineering: 1/5

Changes to master data impact many more decisions. These changes need to be communicated effectively. If they aren’t, people will find discrepancies and lose trust in your data. This causes slower decisions and impacts revenue. 2/5

Master data needs the highest quality possible. What does this look like?
- the pipeline has comprehensive data quality checks that stop data flow when they’re violated

- the pipeline uses write-audit-publish pattern so bad data leaking into production is minimized
3/5

Read 5 tweets

Zach Wilson

@EcZachly

Oct 26, 2022

Adhoc SQL queries and SQL queries running in production will generally look different. Copying the data scientist’s query into Airflow isn’t enough to be “production ready”

Here’s some things to look for in ad-hoc queries that should be changed before moving to production. 1/4

1. GROUP BY 1,2,3… / ORDER BY 1,2,3
This is used to speed up writing ad-hoc queries. Please spell them out in production.

2. SELECT *
This grabs all the columns quickly for ad-hoc queries. Please spell out * in production.
2/4

3. LIMIT 1000
Limits should generally be removed when moving to production since you want the entire data set.

4. Sub queries
Sub queries should almost always be abstracted as CTEs when running in production.
3/4

Read 4 tweets

Share this page!

Enter Twitter Thread URL to Unroll

Zach Wilson

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @EcZachly

Zach Wilson

Zach Wilson

Zach Wilson

Zach Wilson

Zach Wilson

Zach Wilson

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!