I write about data engineering | SQL | Python | Distributed systems. Get my free data engineering course at https://t.co/sZTEcV0Q9W
Dec 10, 2024 • 11 tweets • 2 min read
I've been doing data engineering coding interviews for 10 years.
I'll teach you the most common patterns in 5 minutes.
#datastructures #interview
1. Data structures
Know how to use standard data structures in your language.
In Python, these include:
* dictionary, list, set, queue, Collections
Know the read-and-write time complexities for these data structures.
Aug 20, 2024 • 8 tweets • 2 min read
🎯 Mastering Data Pipeline Stages: The Key to Reliable Data Systems 🎯
Transitioning to a DE role can feel like stepping into a whole new world. If you're navigating terms like "landing," "raw," "cleaned," "curated," and "analytical," you're not alone.
Let me break it down 👇
1. Raw Data: This is your starting point—data pulled directly from the source, untouched and in its original format.
Apr 15, 2024 • 5 tweets • 1 min read
If you are learning about OLAP databases, here are a few free, open-source ones. 👇
1. DuckDB: Simple & lightweight. Best for practicing SQL skills.
#data #dataengineering #SQL #database
2. Apache Trino: Can pull data from multiple underlying DBs. Data loading, query plan, and ops resemble real-life warehouse systems.
Mar 18, 2024 • 7 tweets • 1 min read
Most data engineers realize LLMs can't replace good data engineers. Deep knowledge of the data stack will be in high demand! LLMs will create more opportunities. Here's why: 🧵
#data #dataengineering #database #LLM
1. They are great at writing code quickly, provided you specify the exact details and fine-tune the produced code to ensure high quality.
Nov 16, 2023 • 4 tweets • 1 min read
Technical skills for data engineers can be broken down into these core components 🧵
1. Data storage: Distributed data storage , partitioning , clustering , column encoding , & table formats.
2. Data processing: Data shuffling , in memory processing and query planner.
3. Data modeling: Dimensional model, data vault.
May 23, 2023 • 7 tweets • 2 min read
Testing your pipelines before merging is crucial to ensure they do not fail in production. However, testing data pipelines is complex (and expensive) due to the data size, confidentiality, and time it takes to test a data pipeline.
🧵 #data#dataengineering#testing#dataops
Here are a few ways to get data for your tests:
1. Copying data: An exact copy of the prod data for testing will ensure that our changes are not breaking the pipeline. This is expensive! You can use a part of data for testing, accepting possible edge case misses.
May 22, 2023 • 4 tweets • 4 min read
Data engineers work with multiple systems & it's crucial to understand DevOps. Shown below are a few DevOps concepts to familiarize oneself with:
If you have worked in the data space, you would have heard the term Metadata. It is used as a catch-all term. Here are a few things to think about when someone mentions Metadata 👇
#data#dataengineering#metadata#dataops1. Orchestration: Time of run, re-run information, pipeline structure, the execution time for the pipeline, pipeline failure times, etc
2. Data processing: Input parameters, failure stack trace, number of rows processed, number of rows in output, number of discarded rows, etc