, 21 tweets, 7 min read Read on Twitter
I've decided to share my slide deck detailing my ... concerns about the F measure. tl;dr: just don't.
First, some definitions of precision, recall, and F. F is a harmonic mean of P and R. Harmonic means are appropriate for averaging ratios, which is van Rijsbergen's original motivation for recommending F. F is for sets, not ranked lists.
Here is the tl;dr on one slide.
The first problem is that, while precision and recall are static on a given set, in real life they vary inversely.
This is a recall-precision graph from an information retrieval experiment. It shows as you step down a ranked list, recall goes up, but precision either stays steady or (usually) goes down.
If your precision stays steady as recall increases, either your dataset is too small or your true positive rate is so high you don't need search.
Now, maybe this is unfair because F is for sets, not ranked lists. But here's the thing with sets as we think of them in AI/ML/IR/NLP: the sets output from classifiers are defined by optimizing a decision boundary. Move the threshold, change the set.
So, rather than choosing the boundary, why not let the threshold act as an operating point, and show what happens for all values of the threshold? Voila, we're back to a ranked list, which (I claim) is a good thing.
Anyway, if you have two ratios, one that goes down as the other goes up, you probably shouldn't average them. The beta parameter of F sets the rate of the tradeoff relationship.
Getting to the second problem here. If you have two F values for two systems, which is better?
To know which is better, you need to know what the precision and recall values were, because a change in F can be due to a change in precision, recall, or both. (Remember, usually it's "both".)
So note that this table has a whole column you don't need. You can shorten your figures and make more room in your compressed conference paper by just reporting P and R, and dropping F. #winning
Last problem. F is an average of P and R. If we then go and average a bunch of F scores, we have no hope of understanding what the underlying system behavior was.
Here's an R boxplot with the F values jiggled on the overlay so you can see them all. Those F values are all over. Now remember that each point is actually a P value and an R value. Which system is better?
(forgot to say, system labels are kind of random in this talk, as I just grabbed some handy TREC runs for demonstrating the points. These issues hold for any F scores.)
Now let's go towards a better approach. What if we just were to plot the individual P and R values, instead of trying to look at a range of F values? I can easily see that one system here varies quite a bit more on the precision axis.
This the money tweet. Compute the "isocurves" of F, the P and R values that give you the same F score, and overlay those isocurves on the P/R scatterplot. Points on the same line have the same F. But not the same performance!
This view helps us see the relationship of F to P and R, which are the actual system performance characteristics. You could learn something with a plot like that. Try varying beta, for example.
(this is F with beta=1 btw)
Really, this is a talk about the danger of single measures that summarize complex performance characteristics. You can almost always break the summary number down into something much more useful.
One of my research questions is how ML systems can improve F by optimizing a loss function that is dominated by precision. There is a disconnect between the loss metric and the evaluation metric, so SGD isn't walking where you want it to.
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Ian Soboroff
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!