โก๏ธUnsupervised learning โก๏ธ
Input data is unlabeled & the program learns to recognize the inherent patterns in the input data
Eg: Data across few people's eating habits
๐ธModel input = ๐๐ฅฆ๐ง ๐๐ฅ๐ฅ๐
๐ธModel output = cluster of vegetarian/vegan
A ๐งต
2/8 When is unsupervised learning used?
๐ธ On large datasets where annotating (labeling) data is costly
๐ธ When we don't know how many classes might exist in the data
๐ธ Cluster the data to apply classification on the individual clusters
@quaesita 3/8
Common types of unsupervised learning:
๐น Clustering - divide the data by similarity
๐ Eg: Target marketing, Customer recommendation
๐น Dimensionality reduction - Find wider dependencies
๐ Eg: Big data visualizations, structure discovery
@quaesita 5/8
๐ Centroid-based Clustering organizes the data into non-hierarchical clusters
๐น These algorithms are efficient but sensitive to initial conditions and outliers.
๐น k-means is the most widely-used centroid-based clustering algorithm which is efficient, effective, & simple
@quaesita 6/8
๐ Density-based clustering connects areas of high example density into clusters.
๐น Allows for arbitrary-shaped distributions as long as dense areas can be connected
๐น These algorithms have difficulty with data of varying densities & high dimensions
@quaesita 7/8
๐ Distribution-based clustering assumes data is composed of distributions
๐น such as Gaussian distributions
๐น As distance from the distribution's center increases, the probability that a point belongs to the distribution decreases
@quaesita 8/8
๐ Hierarchical Clustering
๐น Creates a tree of clusters
๐น Well suited to hierarchical data, such as taxonomies
๐น Advantage: Any number of clusters can be chosen by cutting the tree at the right level
Answer these questions
โ What's your teams ML expertise?
โ How much control/abstraction do you need?
โ Would you like to handle the infrastructure components?
๐งต ๐
@SRobTweets created this pyramid to explain the idea.
As you move up the pyramid, less ML expertise is required, and you also donโt need to worry as much about the infrastructure behind your model.
@SRobTweets If youโre using Open source ML frameworks (#TensorFlow) to build the models, you get the flexibility of moving your workloads across different development & deployment environments. But, you need to manage all the infrastructure yourself for training & serving
โ๏ธ How to deal with imbalanced datasets?โ๏ธ
Most real-world datasets are not perfectly balanced. If 90% of your dataset belongs to one class, & only 10% to the other, how can you prevent your model from predicting the majority class 90% of the time?
๐งต ๐
๐ฑ๐ฑ๐ฑ๐ฑ๐ฑ๐ฑ๐ฑ๐ฑ๐ฑ๐ถ (90:10)
๐ณ ๐ณ ๐ณ ๐ณ ๐ณ ๐ณ ๐ณ ๐ณ ๐ณ โ ๏ธ (90:10)
There can be many reasons for imbalanced data. First step is to see if it's possible to collect more data. If you're working with all the data that's available, these ๐ techniques can help
Here are 3 techniques for addressing data imbalance. You can use just one of these or all of them together:
โ๏ธ Downsampling
โ๏ธ Upsampling
โ๏ธ Weighted classes
Since it is Day 10 of #31DaysofML it's perfect to discuss 1๏ธโฃ0๏ธโฃ things that can go wrong with #MachineLearning Projects and what you can do about it!
I watched this amazing presentation by @kweinmeister that sums it all up
A ๐งต
@kweinmeister 1๏ธโฃ You aren't solving the right problem
โWhat's the goal of your ML model?
โHow do you assess if your model is "good" or "bad"?
โWhat's your baseline?
๐ Focus on a long-term mission with maximum impact
๐ Ensure that your problem is a good fit for ML
@kweinmeister 2๏ธโฃ Jumping into development without a prototype
๐ ML project is an iterative process
๐ Start with simple model & continue to refine it until you've reached your goal
๐ Quick prototype can tell a lot about hidden requirements, implementation challenges, scope, etc
๐โโ๏ธ I thought today I would share a tip that has helped me in my #MachineLearning journey
๐กThe best way to learn ML is to pick a problem that you feel excited about & let it guide your learning path. Don't worry about the terms or tools, it's all secondary
Here's an example. Few weeks ago I wanted to live translate an episode of @GCPPodcast. The first question I asked myself was:
๐ค Does any video/audio translation API already exist?
๐น If so - I would give that a try
๐น If not, I would create it from scratch
@GCPPodcast Next, I started digging into the Media Translation API which would translate audio & video data.
My point is:
๐ You don't always need to create a model
๐ Save yourself time & resources by using the models that already exist (if they server your purpose)
โฌ๏ธ Reducing Loss โฌ๏ธ
An iterative process of choosing model parameters that minimize loss
๐ Loss function is how we compute loss
๐ Loss function curve is convex for linear regression
A ๐งต ๐
Calculating loss for every value of W isn't efficient: most common way is called gradient descent
๐ Start with any value of w, b (weights & biases)
๐ Keep going until overall loss stops changing or changes slowly
๐ That point is called convergence
As you probably already guessed, gradient is a vector with:
๐ Direction
๐ Magnitude
Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (or step size) to determine the next point.