, 22 tweets, 10 min read
My Authors
Read all threads
When a model predicts a label with a given score, what does that score represent? We'll try to answer that on today's #IDATHAINSIGHTS thread 🧵- 1/22
2/22 - Usually, it is used and documented as a measure of how confident the model is about it's prediction. Sometimes it's also interpreted as a probability. But wait, a probability of what?

#IDATHAINSIGHTS
3/22 - Strictly speaking, in order for it to be a probability, it just needs to follow these two rules: 👇
- it has to be a value between 0 and 1
- the sum of all possible values must be 1
en.m.wikipedia.org/wiki/Probabili…

#IDATHAINSIGHTS
4/22 - Usually, the output layer of a #NeuralNetwork (NN) is a softmax layer, which guarantees both rules are satisfied.
en.wikipedia.org/wiki/Softmax_f…

#IDATHAINSIGHTS
5/22 - But is this enough to give to a certain value like 0.98 a probabilistic interpretation? 🤔

#IDATHAINSIGHTS
6/22 - Well, it turns out that if you just draw random scores, and apply a #softmax layer on them, you end up with confidence scores which add up to one and are between 0 and 1: a probability measure, but a useless one. We'll need more than this

#IDATHAINSIGHTS
7/22 - Of course a #NeuralNetwork is much more than a random number generator: all its layers are trained to minimize an error function, which is usually the cross entropy.

#IDATHAINSIGHTS
8/22 - When a model is trained to minimize cross entropy, and uses a softmax layer as an output layer, it can be shown that the cross entropy is minimized when the output of the system coincides with the posterior of the classes given the samples.

#IDATHAINSIGHTS
9/22 - This is the reason why we think that the output of the system should be posteriors: the probability of this class being the correct, given this sample. 👌

#IDATHAINSIGHTS
10/22 - So, in practice, ¿What do we expect from a 0.98 score? It's desirable that if this is actually a posterior, then 98% of the predictions with 0.98 confidence should be correct, and the other 2 should not.

#IDATHAINSIGHTS
11/22 - When this happens, we say that the model is calibrated: the scores modeled as probabilities are also verified in practice. However, it is also known that modern NN on practice are really badly calibrated.
arxiv.org/pdf/1706.04599…

#IDATHAINSIGHTS
12/22 - So, ¿How do we check if a model is calibrated? ☑️

#IDATHAINSIGHTS
13/22 - All we need to do is to construct a histogram on the confidence scores values. Suppose it has 10 bins, and then, on each bin, the samples on it should have exactly the same accuracy the bin represents.

#IDATHAINSIGHTS
14/22 - Otherwise, the model is uncalibrated and you should not interpret its scores as probabilities. The diagrams in the image are called "reliability diagrams".

#IDATHAINSIGHTS
15/22 - There is a good amount of ways to measure this miscalibration, one of the most common is Expected Calibration Error (ECE), as defined in the picture. If you want to read more on the pros & cons of different metrics, checkout this paper 📃
openaccess.thecvf.com/content_CVPRW_…
16/22 - Well, but, if there are theoretical guarantees that cross entropy + softmax output posteriors, why are modern NNs so badly calibrated?

#IDATHAINSIGHTS
17/22 - There are several things that contribute to that: batch normalization, depth & width of the NN for instance. They generally lead to more accurate models but at the expense of miscalibration.

#IDATHAINSIGHTS
18/22 - In addition to this, the distribution of the training data is usually modified to improve the training process.

#IDATHAINSIGHTS
19/22 - It's also important to know that even if our models are really poorly calibrated, they still are capable of predicting the right class. That is because calibration and discrimination are not dependent on each other. 🔷

#IDATHAINSIGHTS
20/22 -
Discrimination: how well the scores separate the classes
Calibration: whether those scores can be interpreted probabilistically.

#IDATHAINSIGHTS
21/22 - Should you worry about miscalibration then? Well, it depends on your use case. If you just want to minimize a classification metric (accuracy, f-score, etc.), you may not care, but if you need your scores to be posteriors, you should fix it. How to do it? Stay tuned 😜
22/22 - Interested in this topic? Checkout Luciana Ferrer's great talk at @Khipu_AI 🙌
tv.vera.com.uy/video/55289

References: 📃
drive.google.com/file/d/1j7lykM…
scikit-learn.org/stable/modules…
cs231n.stanford.edu
Missing some Tweet in this thread? You can try to force a refresh.

Keep Current with IDATHA

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!