Latest Twitter Threads by @startdataeng on Thread Reader App

Dec 10, 2024 • 11 tweets • 2 min read

I've been doing data engineering coding interviews for 10 years.

I'll teach you the most common patterns in 5 minutes.

#datastructures #interview 1. Data structures

Know how to use standard data structures in your language.

In Python, these include:

* dictionary, list, set, queue, Collections
Know the read-and-write time complexities for these data structures.

Aug 20, 2024 • 8 tweets • 2 min read

🎯 Mastering Data Pipeline Stages: The Key to Reliable Data Systems 🎯

Transitioning to a DE role can feel like stepping into a whole new world. If you're navigating terms like "landing," "raw," "cleaned," "curated," and "analytical," you're not alone.

Let me break it down 👇 1. Raw Data: This is your starting point—data pulled directly from the source, untouched and in its original format.

Apr 15, 2024 • 5 tweets • 1 min read

If you are learning about OLAP databases, here are a few free, open-source ones. 👇

1. DuckDB: Simple & lightweight. Best for practicing SQL skills.

#data #dataengineering #SQL #database 2. Apache Trino: Can pull data from multiple underlying DBs. Data loading, query plan, and ops resemble real-life warehouse systems.

Mar 18, 2024 • 7 tweets • 1 min read

Most data engineers realize LLMs can't replace good data engineers. Deep knowledge of the data stack will be in high demand! LLMs will create more opportunities. Here's why: 🧵

#data #dataengineering #database #LLM 1. They are great at writing code quickly, provided you specify the exact details and fine-tune the produced code to ensure high quality.

Nov 16, 2023 • 4 tweets • 1 min read

Technical skills for data engineers can be broken down into these core components 🧵

1. Data storage: Distributed data storage , partitioning , clustering , column encoding , & table formats. 2. Data processing: Data shuffling , in memory processing and query planner.

3. Data modeling: Dimensional model, data vault.

May 23, 2023 • 7 tweets • 2 min read

Testing your pipelines before merging is crucial to ensure they do not fail in production. However, testing data pipelines is complex (and expensive) due to the data size, confidentiality, and time it takes to test a data pipeline.
🧵
#data #dataengineering #testing #dataops Here are a few ways to get data for your tests:

1. Copying data: An exact copy of the prod data for testing will ensure that our changes are not breaking the pipeline. This is expensive! You can use a part of data for testing, accepting possible edge case misses.

May 22, 2023 • 4 tweets • 4 min read

Data engineers work with multiple systems & it's crucial to understand DevOps. Shown below are a few DevOps concepts to familiarize oneself with:

1. Docker: docs.docker.com/get-started/
2. Kubernetes: kubernetes.io/docs/concepts/…
3. CI/CD: resources.github.com/ci-cd/

#dataengineering
#data 4. IAC: pulumi.com/what-is/what-i…
5. Monitoring: atlassian.com/devops/devops-…
6. Access control: techtarget.com/searchsecurity…
7. Key management: aws.amazon.com/kms/?c=sc&sec=…
8. Encrypted connections: dev.mysql.com/doc/refman/8.0…

Jan 9, 2023 • 6 tweets • 3 min read

If you have worked in the data space, you would have heard the term Metadata. It is used as a catch-all term. Here are a few things to think about when someone mentions Metadata 👇

#data #dataengineering #metadata #dataops 1. Orchestration: Time of run, re-run information, pipeline structure, the execution time for the pipeline, pipeline failure times, etc

2. Data processing: Input parameters, failure stack trace, number of rows processed, number of rows in output, number of discarded rows, etc

Share this page!

Enter URL or ID to Unroll