This is a Twitter series on #FoundationsOfML. Today, I want to talk about another fundamental question:

❓ What makes a metric useful for Machine Learning?

Let's take a look at some common evaluation metrics and their most important caveats... 👇🧵
Remember our purpose is to find some optimal program P for solving a task T, by maximizing a performance metric M using some experience E.

We've already discussed different modeling paradigms and different types of experiences.

👉 But arguably, the most difficult design decision in any ML process is which evaluation metric(s) to use.
There are many reasons why choosing the right metric is crucial.

🔑 If you cannot measure progress, you cannot objectively decide between different strategies.

This is true when solving any problem, but in ML the consequences are even bigger: 👇
🔶 In Machine Learning, the metric you choose directly determines the type of solution you end up with.

Remember that "solving" a problem in ML is actually about *searching*, between different models, the one that maximizes a given metric.
📝 Hence, metrics in Machine Learning are not auxiliary tools to evaluate a solution. They are what *guides* the actual process of finding a solution.

Let's talk about some common metrics, focusing on classification for simplicity (for now): 👇
1️⃣ Accuracy measures the percentage of the objects for which you guessed their category right.

It's probably the most commonly used metric in the most common type of ML problem.
👍 The main advantage of accuracy is that it is very easy to interpret in terms of the business domain, and it is often aligned with what you actually want to achieve in classification: be right as many times as possible.
👎 One caveat is that accuracy is not differentiable, and cannot be used directly as the target in gradient-based optimization processes, such as neural networks, but there are easy solutions for this problem.
👎 Arguably the biggest problem of accuracy is that it counts every error as equal.

This is not often the case. It can be far worse to tell a sick person to go home than to tell a healthy person to take the treatment (depending on the treatment, of course).
👎 On the other hand, if the number of elements in each category is not similar, you can be making a very large mistake (in relative terms) on the less populated category.

You can get >99% accuracy if you just tell everyone you find on the street that they don't have COVID.
The problem with Accuracy is that it smooths away different types of errors under the same number.

👉 If you care about one specific class more than the rest, measure 2️⃣Precision and 3️⃣Recall instead, as they tell you more about the nature of the mistakes you're making.
In general, there are two types of errors we can make when deciding the category C of an element:

🔹 We can say one element belongs to C when it doesn't (type I).

🔹 We can fail to say an element belongs to C when it does (type II).

How about we measure both separately: 👇
2️⃣ Precision measures the percentage of times you say the category is C, and you're right (type I).

3️⃣ Recall measures the percentage of elements of category C that you correctly identified (type II).
By looking at these metrics separately, you can better identify what kind of error you're making.

📝 If you still want a kind of average that weights both, you can use the F-Measure, which allows you to prioritize precision vs recall to any desired degree.
Precision and Recall are also very intuitive to interpret, but they still don't tell us the whole story.

👉 When we have more than two categories, we can fail at any one of them by confusing it with any other. Here, again, precision and recall are too general.
✏️ A *confusion matrix* tells us how many times we confuse each category C1 with any other category Ci across a test set.

It looks something like this.
✏️ Every number in a diagonal is a prediction we got right, and every other number is a prediction we got wrong.

Accuracy, precision, and recall are easy to compute from the confusion matrix (I'll leave you that as an exercise 😜).
👍 The matrix itself shows a larger picture. It can tell us, for example, where we should focus on gathering more data.

👎 However, confusion matrices don't give us a single number we can optimize for and are thus harder to interpret.
The story we've seen here is common all over Machine Learning.

🔸 We can have simple, high-level, interpretable metrics, that hideaway the nuance.

🔸 Or we can have low-level metrics that tell a bigger picture, but require more effort to interpret.
There is a lot more to tell about metrics and evaluation in general, and we've just focused on a very small part of the problem.

Some of the issues that need to be kept in mind: 👇
🔥 The metric we would like to optimize might not be optimizable at all, either because it's hard to evaluate (e.g., it's expensive or requires a human evaluator) or because it's not compatible with our optimization process (e.g., it's not differentiable).
🔥 We can have multiple contradictory objectives, and no trivial way to combine them into a single metric.

🔥 And sometimes we can judge a solution intuitively, but we have no idea know how to write a mathematical formulation for a metric that encodes that intuition.
🤜 It's hard to overstate how important this topic is.

Almost every alignment problem in AI can be traced back to a poorly defined metric. For example, maximizing engagement is arguably a large part of the reason why social media is as broken as it is.
☝️ There is no objective way to decide what's the best metric for a problem by looking at the data alone. We have to decide what we want to aim for, and that in turn will define the problem we are actually solving.

⏳ Next time, we'll talk about some common problem types.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Alejandro Piad Morffis

Alejandro Piad Morffis Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @AlejandroPiad

31 Jan
One of the very interesting questions that really got me thinking yesterday (they all did to an important degree) was from @Jeande_d regarding how to balance between learning foundational/transferable skills vs focusing on specific tools.
@Jeande_d My reasoning was that one should try hard not to learn too much of a tool, because any tool will eventually disappear. But tools are crucial to be productive, so one should still learn enough to really take advantage of the unique features of that tool.
@Jeande_d One way I think you can try to hit that sweet spot is practice some sort of dropout regularization on your common tool set.

In every new project, substitute one of your usual tools for some convenient alternative. It will make you a bit less productive, to be sure...
Read 5 tweets
13 Jan
This is a Twitter series on #FoundationsOfML.

❓ Today, I want to start discussing the different types of Machine Learning flavors we can find.

This is a very high-level overview. In later threads, we'll dive deeper into each paradigm... 👇🧵
Last time we talked about how Machine Learning works.

Basically, it's about having some source of experience E for solving a given task T, that allows us to find a program P which is (hopefully) optimal w.r.t. some metric M.

According to the nature of that experience, we can define different formulations, or flavors, of the learning process.

A useful distinction is whether we have an explicit goal or desired output, which gives rise to the definitions of 1️⃣ Supervised and 2️⃣ Unsupervised Learning 👇
Read 18 tweets
12 Jan
A big problem with social and political sciences is that they *look* so intuitive and approachable that literally everyone has an opinion.

If I say "this is how quantum entanglement works" almost no one will dare to reply.

But if I say "this is how content moderation works"...
And the thing is, there is huge amount of actual, solid science on almost any socially relevant topic, and most of us are as uninformed in that as we are on any dark corner of particle physics.

We just believe we can have an opinion, because the topic seems less objective.
So we are paying a huge disrespect to social scientists, who have to deal every day with the false notion that what they have been researching for years is something that anyone, thinking for maybe five minutes, can weigh in. This is of course nonsense.
Read 5 tweets
12 Jan
I'm starting a Twitter series on #FoundationsOfML. Today, I want to answer this simple question.

❓ What is Machine Learning?

This is my preferred way of explaining it... 👇🧵
Machine Learning is a computational approach to problem-solving with four key ingredients:

1️⃣ A task to solve T
2️⃣ A performance metric M
3️⃣ A computer program P
4️⃣ A source of experience E
You have a Machine Learning solution when:

🔑 The performance of program P at task T, as measured by M, improves with access to the experience E.

That's it.

Now let's unpack it with a simple example 👇
Read 23 tweets
29 Dec 20
I've been a vocal opponent of the "neural networks are brain simulations" analogy, not because it's *wrong* but because I believe it's harmful for beginners.

I want to propose an alternative analogy for approaching deep learning from a dev background.

👇
Think about detecting a face in an image.

How would you even start to write a program for that?

You know it's gonna have something to do with finding a "nose" and two "eyes", but how can you go from an array of pixels to something that looks like an eye, in whatever position?
Now, suppose you have access to thousands of faces and non-faces.

How does that changes the problem?

Instead of thinking in the problem domain (finding faces) you can now take a leap upwards in abstraction, and think in the meta-problem domain (finding face finders).
Read 12 tweets
21 Sep 20
Hey, today is #MindblowingMonday 🤯!

A day to share with you amazing things from every corner of Computer Science.

Today I want to talk about Generative Adversarial Networks 👇
🍬 But let's begin with some eye candy.

Take a look at this mind-blowing 2-minute video and, if you like it, then read on, I'll tell you a couple of things about it...

Generative Adversarial Networks (GAN) have taken by surprise the machine learning world with their uncanny ability to generate hyper-realistic examples of human faces, cars, landscapes, and a lot of other stuff, as you just saw.

Want to know how they work? 👇
Read 12 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!