Answer these questions
❓ What's your teams ML expertise?
❓ How much control/abstraction do you need?
❓ Would you like to handle the infrastructure components?
🧵 👇
@SRobTweets created this pyramid to explain the idea.
As you move up the pyramid, less ML expertise is required, and you also don’t need to worry as much about the infrastructure behind your model.
@SRobTweets If you’re using Open source ML frameworks (#TensorFlow) to build the models, you get the flexibility of moving your workloads across different development & deployment environments. But, you need to manage all the infrastructure yourself for training & serving
⚖️ How to deal with imbalanced datasets?⚖️
Most real-world datasets are not perfectly balanced. If 90% of your dataset belongs to one class, & only 10% to the other, how can you prevent your model from predicting the majority class 90% of the time?
🧵 👇
🐱🐱🐱🐱🐱🐱🐱🐱🐱🐶 (90:10)
💳 💳 💳 💳 💳 💳 💳 💳 💳 ⚠️ (90:10)
There can be many reasons for imbalanced data. First step is to see if it's possible to collect more data. If you're working with all the data that's available, these 👇 techniques can help
Here are 3 techniques for addressing data imbalance. You can use just one of these or all of them together:
⚖️ Downsampling
⚖️ Upsampling
⚖️ Weighted classes
Let's explore #MachineLearning terms for supervised learning:
🔸Labels - the thing we're predicting
🔸Features - an input variable
🔸Examples - particular instance of data (Labeled/Unlabeled)
🔸Models - defines the relationship between features & label.
A 🧵
🔸Labels - the thing we're predicting
Eg: The y variable in simple linear regression. The label could be the future price of wheat, the kind of animal shown in a picture, the meaning of an audio clip, or just about anything.
📌Quantity & quality of your data dictate how accurate our model is
📌The outcome of this step is usually a table with some values (features)
📌 If you want to use pre-collected data - get it from sources such as Kaggle or BigQuery Public Datasets