📣Data Engineering Projects for Beginners 2022

👇🧵[1/x]

#dataengineering #python #Docker #developers #aws #GoogleCloud #apacheairflow
Tracking your Uber Rides and Uber Eats expenses through a data engineering process

Technologies and skills:
Python, Docker, Apache Airflow, AWS Redshift, Power BI, data modelling, Task schedulling, ETL and ELT processes, Data warehousing, Cloud

🧵[2/x]

github.com/Wittline/uber-…
Scheduling Big Data Workloads and Data Pipelines in the Cloud with pyDag

Technologies and skills:
Python, Docker, Big Data, Cloud, Google Cloud, Redis, DAG, Parallel Processing, Apache Spark

🧵[3/x]

github.com/Wittline/pyDag
Building Big Data Pipelines in the Cloud with AWS EMR

Technologies and skills:
Python, PySpark, AWS EMR, Task Schedulling, IAC, EC2 Instances, Apache Spark, Cloud

🧵[4/x]

github.com/Wittline/pyspa…
Building a Lossless Data Compression and Data Decompression Pipeline

Technologies and skills:
Python, Data compression, BZIP2, Parallel programming

🧵[5/x]

github.com/Wittline/wbz
Learn how to dockerize an Apache Spark Standalone Cluster

Technologies and skills:
Python, Jupyter Notebook, Apache Spark, Docker, docker-compose, Hive

🧵[6/x]

github.com/Wittline/apach…
Dockerizing and Consuming an Apache Livy environment

Technologies and skills:
Python, Big Data, Docker, docker-compose, Apache Livy, Apache Spark, PostgreSQL, PySpark, Jupyter Notebook

🧵[7/x]

github.com/Wittline/docke…
Design, Development and Deployment of a simple Data Pipeline

Technologies and skills:
Python, data Modelling, Docker, docker-compose, PostgreSQL, data pipeline, FastApi

🧵[8/x]

github.com/Wittline/data-…
Dockerizing a Python Script for Faster Web Scraping

Technologies and skills:
Python, Docker, Sqlite, Dockerfile, Web scraping, Data pipeline, FastApi

🧵[9/x]

github.com/Wittline/data-…
Understanding Similarity Measures for Text Analysis

Technologies and skills:
Python, Machine Learning, Similarity measures, Distance metrics, Text Analysis

🧵[10/x]

github.com/Wittline/dista…
Learn how to build a content-based Movie Recommender System

Technologies and skills:
Python, Machine Learning, TF-IDF, Cosine similarity, BM25, BERT, NLP, word2vec, Text Analysis, recsys

🧵[11/x]

github.com/Wittline/recom…
A Text Analysis of Speeches

Technologies and skills:
Python, Machine Learning, NLP, word2vec, Text Analysis, Sentiment Analysis, PCA, t-SNE, Word Embeddings, Text Preprocessing, Web scraping, Data Visualization

🧵[12/x]

github.com/Wittline/text-…
Dropout Students Prediction

Technologies and skills:
R, Genetic algorithm, Neural Networks, K-Means, Clustering, Machine Learning

🧵[13/x]

github.com/Wittline/Dropo…
I tweet about all things data related. Follow me for more content.

@thecodemancer_

🧵[15/x]

linkedin.com/in/davidregala…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with David Regalado

David Regalado Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @thecodemancer_

Jun 25
💡Seven ways to become a more effective founder

Credits to @GoogleStartups

#startups #founders

🧵[1/x] Image
🚨⚠️People issues are the biggest risk to funded startups.

55% of startups fail because of people problems, according to a study by Harvard, Stanford, and University of Chicago researchers.

🧵[2/x]
1. Minimize unnecessary micromanagement

Micromanaging can be helpful in certain situations, the most effective leaders aim to delegate work in order to scale both themselves and their businesses. Our data suggests that micromanaging can be a fatal flaw for CEOs.

🧵[3/x]
Read 11 tweets
Jun 3
Best websites for data science! 😱

/

¡Las mejores páginas web para aprender ciencia de datos! 😱

🧵⬇ [1/7]

#datascience
Read 9 tweets
May 25
Best-of Machine Learning projects with Python 👇

[1/x] 🧵

#Python #MachineLearning #DataScience
Machine Learning Frameworks
56 projects

github.com/ml-tooling/bes…
Data Visualization
51 projects

github.com/ml-tooling/bes…
Read 30 tweets
May 9
☁ You will likely encounter pushback when moving to the cloud. Moving to something new may seem risky and unnecessary to the developers. This requires a cultural shift.

💎 Here are some tips on how to tackle this problem.

#cloud #googlecloud #azure #aws
1. Sync with cross-functional teams early and often. Train them so they understand the benefits of the cloud and are comfortable and knowledgeable using it.
2. Help teams understand the benefits, the project's processes, the desired goals and outcomes.
Read 6 tweets
May 7
Without effective testing, there's no way to know if your database has been migrated correctly. There are many things you need to verify.

🧵1/10

#databases #databasemigration #dataengineering #sql #googlecloud
1. Was the database schema migrated correctly?
2. Has all the data been migrated?
3. How about user logins?
4. Can all of the users still connect and can users only access the data they're permitted to access?

🧵2/10

#databases #databasemigration #dataengineering #sql
There are basically three categories of testing that needs to be considered; structural, functional and non-functional.

🧵3/10

#databases #databasemigration #dataengineering #sql #googlecloud
Read 11 tweets
Mar 27
Can you imagine serverless Spark + BigQuery together? 🤯

Forget about managing clusters and tuning infrastructure if your job is to focus on create business value.

👇

🧵1/6

#googlecloud #bigquery #spark #dataengineering
Why Serverless Spark?

💡 Developers can focus on code and logic. They do not need to manage clusters or tune infrastructure. They submit #Spark jobs from their interface of choice, and processing is auto-scaled to match the needs of the job.

🧵2/6

#googlecloud #bigquery #gcp
💡 Data engineering teams do not need to manage and monitor infrastructure for their end users. They are freed up to work on higher value #dataengineering functions.

💡 Pay only for the job duration, vs paying for infrastructure time.

🧵3/6

#googlecloud #bigquery #spark
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(