Learning data engineer can be tough, but it doesn't have to be!
Sign up for our free "Data Engineering 101" & learn about DE core concepts, building scalable & resilient systems, best practices, data modeling, and building real-world projects.
Testing your pipelines before merging is crucial to ensure they do not fail in production. However, testing data pipelines is complex (and expensive) due to the data size, confidentiality, and time it takes to test a data pipeline.
🧵 #data#dataengineering#testing#dataops
Here are a few ways to get data for your tests:
1. Copying data: An exact copy of the prod data for testing will ensure that our changes are not breaking the pipeline. This is expensive! You can use a part of data for testing, accepting possible edge case misses.
2. Data git: Projects like Nessie and LakeFS can help set up different environments without replicating entire data.
If you have worked in the data space, you would have heard the term Metadata. It is used as a catch-all term. Here are a few things to think about when someone mentions Metadata 👇