Job postings for entry-level data scientists are nonsense π
Don't try to fit all their requirements.
This is what you need to do instead βββ
Do not try to tick all the boxes in these long job postings.
Because you will go crazy.
And because it is a lie you need to rock at Python, SQL, ETL design, data visualization, Deep Learning, and Methapyisics to land an entry-level job in data science.
If so, why are companies asking all these things?
Well, because most of them do not have a clue about data science, so they Copy+Paste the job descriptions they see in top tech companies.
Fear of missing out (FOMO) pushes normal companies to ask for things they do not even need
Here are 2 steps that every real-world ML problem has...
... that you won't learn in Kaggle βββ
β‘οΈ From business problem to ML problem
Every Kaggle competition starts with a clearly defined target metric you need to optimize for.
But, in real-world ML, there is no target metric waiting for you.
It is your job to translate a business problem into an ML problem, by finding the right proxy metric.
This proxy metric is a quantitative and abstract metric, that positively correlates with the actual business metric you want to impact, e.g. accuracy, precision...
All ML systems can be decomposed into 3 pipelines (aka programs):
β Feature pipeline
β Training pipeline
β Inference pipeline
And this is how they work β
The feature pipeline takes raw data, from
- a data warehouse
- an external API, or
- a website, through scrapping
and generate features, aka the inputs for your ML model, and stores them in a Feature Store so that the other 2 pipelines can later use these features.
The training pipeline takes the features from the store and outputs a trained ML model.
These are (in general) the best models for each domain:
-Tabular data β XGBoost
- Computer Vision β Fine-tune a Convolutional Neural Net
- NLP β Fine-tune a Transformer net.