Let's break it down! The basis is this simple formula describing an iterative optimization method.
We have some weights (parameters) and we iteratively update them in some way to reach a goal
Iterative methods are used when we cannot compute the solution directly
Gradient Decent Update π
We define a loss function describing how good our model is. We want to find the weights that minimize the loss (make the model better).
We compute the gradient of the loss and update the weights by a small amount (learning rate) against the gradient.
Here is an illustration of how it works.
The gradient tells us if the loss will decrease (negative gradient) or increase (positive gradient) if we increase the weight.
The learning rate defines how far along the gradient we will jump in the current step of the optimization.
Momentum β½οΈ
Now we add the momentum. It is defined as the weight update in the previous step times a decay factor.
The decay factor is just a number between 0 and 1 defining how much of the previous update will be taken into account. Ξ± = 0 means no momentum and Ξ± = 1 is a lot.
A useful analogy is a ball rolling down a hill. If the hill is steep, the ball will accelerate (we update the weights more)
This will help the ball jump over small local minima and continue down the hill (to a smaller loss).
More momentum means a heavier ball with high inertia
Putting it all together π
So, in the original formula we update the weights using two terms.
The *gradient descent* term pushes us down the slope of the loss function.
The *momentum* term helps us accelerate and jump over small local minima.
Not that hard, right?
Every Friday I repost one of my old threads so more people get the chance to see them. During the rest of the week, I post new content on machine learning and web3.
If you are interested in seeing more, follow me @haltakov
β’ β’ β’
Missing some Tweet in this thread? You can try to
force a refresh
For regression problems you can use one of several loss functions:
βͺοΈ MSE
βͺοΈ MAE
βͺοΈ Huber loss
But which one is best? When should you prefer one instead of the other?
Thread π§΅
Let's first quickly recap what each of the loss functions does. After that, we can compare them and see the differences based on some examples.
π
Mean Square Error (MSE)
For every sample, MSE takes the difference between the ground truth and the model's prediction and computes its square. Then, the average over all samples is computed.
The Cross-Entropy Loss function is one of the most used losses for classification problems. It tells us how well a machine learning model classifies a dataset compared to the ground truth labels.
The Binary Cross-Entropy Loss is a special case when we have only 2 classes.
π
The most important part to understand is this one - this is the core of the whole formula!
Here, Y denotes the ground-truth label, while ΕΆ is the predicted probability of the classifier.
Let's look at a simple example before we talk about the logarithm... π
When machine learning met crypto art... they fell in love β€οΈ
The Decentralized Autonomous Artist (DAA) is a concept that is uniquely enabled by these technologies.
Meet my favorite DAA - Botto.
Let me tell you how it works π
Botto uses a popular technique to create images - VQGAN+CLIP
In simple terms, it uses a neural network model generating images (VQCAN) guided by the powerful CLIP model which can relate images to text.
This method can create stunning visuals from a simple text prompt!
π
Creating amazing images, though, requires finding the right text prompt
Botto is programmed by its creator - artist Mario Klingemann (@quasimondo), but it creates all art itself. There is no human intervention in the creation of the images!
ROC curves measure the True Positive Rate (also known as Accuracy). So, if you have an imbalanced dataset, the ROC curve will not tell you if your classifier completely ignores the underrepresented class.