Tweet

Vladimir Haltakov

13 Oct, 15 tweets, 5 min read

Machine Learning Formulas Explained! 👨‍🏫

This is the formula for the Binary Cross Entropy Loss. This loss function is commonly used for binary classification problems.

It may look super confusing, but I promise you that it is actually quite simple!

Let's go step by step 👇

The Cross-Entropy Loss function is one of the most used losses for classification problems. It tells us how well a machine learning model classifies a dataset compared to the ground truth labels.

The Binary Cross-Entropy Loss is a special case when we have only 2 classes.

👇

The most important part to understand is this one - this is the core of the whole formula!

Here, Y denotes the ground-truth label, while Ŷ is the predicted probability of the classifier.

Let's look at a simple example before we talk about the logarithm... 👇

Imagine we have a bunch of photos and we want to classify each one as being a photo of a bird or not.

All photos are manually so that Y=1 for all bird photos and Y=0 for the rest.

The classifier (say a NN) outputs a probability of the photo containing a bird, like Ŷ=0.9

👇

Now, let's look a the logarithm.

Since Ŷ is a number between 0 and 1, log Ŷ will be a negative number increasing up to 0.

Let's take an example of a bird photo (Y=1):
▪️ Classifier predicts 99% bird, so we get -0.01
▪️ Classifier predicts 5% bird, so we get -3

That's weird 👇

For a loss, we want a value close to 0 if the classifier is right and a large value when the classifier is wrong. In the example above it was the opposite!

Fortunately, this is easy to fix - we just need to multiply the value by -1 and can interpret the value as an error 🤷‍♂️

👇

If the photo is labeled as no being a bird, then we have Y=0 and so the whole term becomes 0.

That's why we have the second part - the negative case. Here we just take 1-Y and 1-Ŷ for the probabilities. We are interested in the probability of the photo not being a bird.

👇

Combining both we get the error for one data sample (one photo). Note that one of the terms will always be 0, depending on how the photo is labeled.

This is actually the case if we have more than 2 classes as well when using one-hot encoding!

OK, almost done with that part 👇

Now, you should have a feeling of how the core of the formula works, but why do we use a logarithm?

I won't go into detail, but let's just say this is a common way to formulate optimization problems in math - the logarithm makes all multiplications to sums.

Now the rest 👇

We know how to compute the loss for one sample, so now we just take the mean over all samples in our dataset (or minibatch) to compute the loss.

Remember - we need to multiply everything by -1 so that we can invert the value and interpret it as a loss (low good, high bad).

👇

Where to find it in your ML framework?

The Cross-Entropy Loss is sometimes also called Log Loss or Negative Log Loss.

▪️ PyTorch - torch.nn.NLLLoss
▪️ TensorFlow - tf.keras.losses.BinaryCrossentropy and CategoricalCrossentropy
▪️ Scikit-learn - sklearn.metrics.log_loss

And if it is easier for you to read code than formulas, here is a simple implementation and two examples of a good (low loss) and a bad classifier (high loss).

@haltakov

I regularly post threads like this on topics like machine learning and self-driving cars.

Follow me @haltakov for more!

https://twitter.com/RubenEdwardRose/status/1448466346103255043

Yes, this is a good point! The input to the loss needs to be probabilities!

For classifiers that don't necessarily output probabilities (for example a NN with ReLU), you usually add a softmax layer.

Or use torch.nn.CrossEntropyLoss in PyTorch.

https://twitter.com/RubenEdwardRose/status/1448466346103255043

https://twitter.com/LightOneFire/status/1448482374375952386

This is actually interesting! There are two very different ways to arrive at the same formula. One is what you mention using the log loss, but the other comes from information theory. They happen to be the same in this context.

machinelearningmastery.com/cross-entropy-…

https://twitter.com/LightOneFire/status/1448482374375952386

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @haltakov

Vladimir Haltakov

@haltakov

21 Sep

There are two problems with ROC curves

❌ They don't work for imbalanced datasets
❌ They don't work for object detection problems

So what do we do to evaluate our machine learning models properly in these cases?

We use a Precision-Recall curve.

Another one of my threads 👇

https://twitter.com/haltakov/status/1438206936680386560

Last week I wrote another detailed thread on ROC curves. I recommend that you read it first if you don't know what they are.

https://twitter.com/haltakov/status/1438206936680386560

Then go on 👇

https://twitter.com/haltakov/status/1435296511772999684

❌ Problem 1 - Imbalanced Data

ROC curves measure the True Positive Rate (also known as Accuracy). So, if you have an imbalanced dataset, the ROC curve will not tell you if your classifier completely ignores the underrepresented class.

More details:

https://twitter.com/haltakov/status/1435296511772999684

👇

Read 19 tweets

Vladimir Haltakov

@haltakov

20 Sep

How to spot fake images of faces generated by a GAN? Look at the eyes! 👁️

This is an interesting paper that shows how fake images of faces can be easily detected by looking at the shape of the pupil.

The pupils in GAN-generated images are usually not round - see the image!

👇

Here is the actual paper. The authors propose a way to automatically identify fake images by analyzing the pupil's shape.

arxiv.org/abs/2109.00162

The bad thing is, GANs will probably quickly catch up and include an additional constraint for pupils to be round...

Read 5 tweets

Vladimir Haltakov

@haltakov

15 Sep

Did you ever want to learn how to read ROC curves? 📈🤔

This is something you will encounter a lot when analyzing the performance of machine learning models.

Let me help you understand them 👇

What does ROC mean?

ROC stands for Receiver Operating Characteristic but just forget about it. This is a military term from the 1940s and doesn't make much sense today.

Think about these curves as True Positive Rate vs. False Positive Rate plots.

Now, let's dive in 👇

The ROC curve visualizes the trade-offs that a binary classifier makes between True Positives and False Positives.

This may sound too abstract for you so let's look at an example. After that, I encourage you to come back and read the previous sentence again!

Now the example 👇

Read 21 tweets

Vladimir Haltakov

@haltakov

14 Sep

Most people seem to use matplotlib as a Python plotting library, but is it really the best choice? 🤔

We are going to compare 5 free and popular libraries:
▪️ Matplotlib
▪️ Seaborn
▪️ Plotly
▪️ Bokeh
▪️ Altair

Which one is the best? Find out below 👇

https://twitter.com/haltakov/status/1436780582361513987

In a survey I did the other day, matplotlib had the most users by a large margin. This was quite surprising to me since I don't really like it...

https://twitter.com/haltakov/status/1436780582361513987

But let's first look at each library 👇

Matplotlib 📈

Matplotlib is one of the most popular libraries out there.

✅ Supports many types of plots
✅ Lots of customization options

❌ Plots look ugly
❌ Limited interactivity
❌ Not very intuitive to use

Read 11 tweets

Vladimir Haltakov

@haltakov

9 Sep

@therobotbrains

I highly recommend listening to the latest eposide of @therobotbrains podcast with @ilyasut.

therobotbrains.ai/podcasts/episo…

Here are some insights I found particulalry interesting 👇

"Neural networks are parallel computers"

That is why they are so powerful - you can train a generic computer to solve your problem. This is also the driver behind Software 2.0 - neural network are becoming more and more capable of solving all kinds of problems.

"Neural networks perform well on tasks that humans can perform very quickly"

Humans don't think much when listening, observing or performing simple tasks.

This means that a neural network can be trained to be good at it as well: NLP, computer vision and reinforcement learning.

Read 4 tweets

Vladimir Haltakov

@haltakov

9 Sep

My setup for recording videos for my machine learning course 🎥

A lot of people asked about my setup the other day, so here a short thread on that. It's nothing fancy, but it does a good job 🤷‍♂️

Details 👇

Hardware ⚙️

▪️ MacBook Pro (2015 model) - screen sharing and recording
▪️ iPhone XS - using the back camera for video recording
▪️ Omnidiretional external mic - connected to the iPhone
▪️ Highly professional camera rig - books mostly about cooking and travel 😄

👇

Software 💻

▪️ OBS Studio - recording of the screen and the camera image
▪️ EpocCam - use your iPhone as a web cam. You can connect your iPhone both over WiFi and cable.
▪️ Google Slides - for presentation
▪️ Jupyter notebooks and Google Colab - for experimenting with code

👇

Read 5 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

Vladimir Haltakov

Try unrolling a thread yourself!

More from @haltakov

Vladimir Haltakov

Vladimir Haltakov

Vladimir Haltakov

Vladimir Haltakov

Vladimir Haltakov

Vladimir Haltakov

Did Thread Reader help you today?

Like this author's thread?