Machine Learning Formulas Explained! 👨‍🏫

This is the formula for the Binary Cross Entropy Loss. It is commonly used for binary classification problems.

It may look super confusing, but I promise you that it is actually quite simple!

Let's go step by step 👇

#RepostFriday
The Cross-Entropy Loss function is one of the most used losses for classification problems. It tells us how well a machine learning model classifies a dataset compared to the ground truth labels.

The Binary Cross-Entropy Loss is a special case when we have only 2 classes.

👇
The most important part to understand is this one - this is the core of the whole formula!

Here, Y denotes the ground-truth label, while Ŷ is the predicted probability of the classifier.

Let's look at a simple example before we talk about the logarithm... 👇
Imagine we have a bunch of photos and we want to classify each one as being a photo of a bird or not.

All photos are manually so that Y=1 for all bird photos and Y=0 for the rest.

The classifier (say a NN) outputs a probability of the photo containing a bird, like Ŷ=0.9

👇
Now, let's look a the logarithm.

Since Ŷ is a number between 0 and 1, log Ŷ will be a negative number increasing up to 0.

Let's take an example of a bird photo (Y=1):
▪️ Classifier predicts 99% bird, so we get -0.01
▪️ Classifier predicts 5% bird, so we get -3

That's weird 👇
For a loss, we want a value close to 0 if the classifier is right and a large value when the classifier is wrong. In the example above it was the opposite!

Fortunately, this is easy to fix - we just need to multiply the value by -1 and can interpret the value as an error 🤷‍♂️

👇
If the photo is labeled as no being a bird, then we have Y=0 and so the whole term becomes 0.

That's why we have the second part - the negative case. Here we just take 1-Y and 1-Ŷ for the probabilities. We are interested in the probability of the photo not being a bird.

👇
Combining both we get the error for one data sample (one photo). Note that one of the terms will always be 0, depending on how the photo is labeled.

This is actually the case if we have more than 2 classes as well when using one-hot encoding!

OK, almost done with that part 👇
Now, you should have a feeling of how the core of the formula works, but why do we use a logarithm?

I won't go into detail, but let's just say this is a common way to formulate optimization problems in math - the logarithm makes all multiplications to sums.

Now the rest 👇
We know how to compute the loss for one sample, so now we just take the mean over all samples in our dataset (or minibatch) to compute the loss.

Remember - we need to multiply everything by -1 so that we can invert the value and interpret it as a loss (low good, high bad).

👇
Where to find it in your ML framework?

The Cross-Entropy Loss is sometimes also called Log Loss or Negative Log Loss.

▪️ PyTorch - torch.nn.NLLLoss
▪️ TensorFlow - tf.keras.losses.BinaryCrossentropy and CategoricalCrossentropy
▪️ Scikit-learn - sklearn.metrics.log_loss
And if it is easier for you to read code than formulas, here is a simple implementation and two examples of a good (low loss) and a bad classifier (high loss).
Every Friday I repost one of my old threads so more people get the chance to see them. During the rest of the week, I post new content on machine learning and web3.

If you are interested in seeing more, follow me @haltakov

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with haltakov.eth 🌍 🇺🇦

haltakov.eth 🌍 🇺🇦 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @haltakov

Mar 3
When machine learning met crypto art... they fell in love ❤️

The Decentralized Autonomous Artist (DAA) is a concept that is uniquely enabled by these technologies.

Meet my favorite DAA - Botto.

Let me tell you how it works 👇
Botto uses a popular technique to create images - VQGAN+CLIP

In simple terms, it uses a neural network model generating images (VQCAN) guided by the powerful CLIP model which can relate images to text.

This method can create stunning visuals from a simple text prompt!

👇
Creating amazing images, though, requires finding the right text prompt

Botto is programmed by its creator - artist Mario Klingemann (@quasimondo), but it creates all art itself. There is no human intervention in the creation of the images!

Botto is trained by the community 👇
Read 11 tweets
Feb 25
There are two problems with ROC curves

❌ They don't work for imbalanced datasets
❌ They don't work for object detection problems

So what do we do to evaluate our machine learning models properly in these cases?

We use a Precision-Recall curve.

Thread 👇

#RepostFriday
Last week I wrote another detailed thread on ROC curves. I recommend that you read it first if you don't know what they are.



Then go on 👇
❌ Problem 1 - Imbalanced Data

ROC curves measure the True Positive Rate (also known as Accuracy). So, if you have an imbalanced dataset, the ROC curve will not tell you if your classifier completely ignores the underrepresented class.

Let's take an example confusion matrix 👇
Read 20 tweets
Feb 24
Is your machine learning model performing well? What about in 6 months? 🤔

If you are wondering why I'm asking this, you need to learn about 𝗰𝗼𝗻𝗰𝗲𝗽𝘁 𝗱𝗿𝗶𝗳𝘁 and 𝗱𝗮𝘁𝗮 𝗱𝗿𝗶𝗳𝘁.

Let me explain this to you using two real world examples.

Thread 👇
Imagine you are developing a model for a self-driving car to detect other vehicles at night.

Well, this is not too difficult, since vehicles have two red tail lights and it is easy to get a lot of data. You model works great!

But then... 👇 Image
Car companies decide to experiment with red horizontal bars instead of two individual lights.

Now your model fails to detect these cars because it has never seen this kind of tail light.

Your model is suffering from 𝗰𝗼𝗻𝗰𝗲𝗽𝘁 𝗱𝗿𝗶𝗳𝘁

👇 ImageImage
Read 11 tweets
Feb 22
Math is not very important when you are using a machine learning method to solve your problem.

Everybody that disagrees, should study the 92-page appendix of the Self-normalizing networks (SNN) paper, before using
torch.nn.SELU.

And the core idea of SNN is actually simple 👇 ImageImageImageImage
SNNs use an activation function called Scaled Exponential Linear Unit (SELU) that is pretty simple to define.

It has the advantage that the activations converge to zero mean and unit variance, which allows training of deeper networks and employing strong regularization.

👇 ImageImage
There are implementations both in PyTorch (torch.nn.SELU) and TensorFlow (tf.keras.activations.selu).

You need to be careful to use the correct initialization function and dropout, but this is well documented.

The code is open-source as well: github.com/bioinf-jku/SNNs

👇
Read 9 tweets
Feb 21
This is like an NFT in the physical world

This is a special edition BMW 8 series painted by the famous artist Jeff Koons. A limited-edition of 99 with a price of $350K - about $200K more than the regular M850i.

If you think about it, you'll see many similarities with NFTs

👇 Image
Artificially scarce

BMW can surely produce (mint 😅) more than 99 cars with this paint. The collection size is limited artificially in order to make it more exclusive.

Same as most NFT collections - they create artificial scarcity.

👇
Its price comes from the story

The $200K premium for the "paint" is purely motivated by the story around this car - it is exclusive, it is created by a famous artist, it is a BMW Art Car.

It is not faster, more reliable, or more economic. You are paying for the story.

👇
Read 10 tweets
Feb 18
Did you ever want to learn how to read ROC curves? 📈🤔

This is something you will encounter a lot when analyzing the performance of machine learning models.

Let me help you understand them 👇

#RepostFriday
What does ROC mean?

ROC stands for Receiver Operating Characteristic but just forget about it. This is a military term from the 1940s and doesn't make much sense today.

Think about these curves as True Positive Rate vs. False Positive Rate plots.

Now, let's dive in 👇
The ROC curve visualizes the trade-offs that a binary classifier makes between True Positives and False Positives.

This may sound too abstract for you so let's look at an example. After that, I encourage you to come back and read the previous sentence again!

Now the example 👇
Read 19 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(