Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Metrics | Comparing Models
Classification with Python

Metrics

By now, we have used the accuracy metric to measure the model's performance. This chapter shows the disadvantages of this metric and introduces several more metrics to fight those problems.
Let's first remember the TP, TN, FN, and FP part of the last chapter.

Accuracy

The accuracy shows a proportion of correct predictions.

But the accuracy has its disadvantages.
Suppose you are trying to predict whether the patient has a rare disease. The dataset contains 99.9% of healthy patients and 0.01% of patients with the disease. Then always predicting the patient to be healthy will give 0.999 accuracy, although a model like this is entirely useless.
Datasets like this are called imbalanced, and balanced accuracy helps to deal with those.

Balanced accuracy

Balanced accuracy calculates the proportion of correct positive predictions and the proportion of correct negative predictions separately and averages them. This means it gives equal importance to each class, regardless of its size.

In the rare disease example, balanced accuracy will equal 0.5 for an always-healthy model. This problem is solved.
But still, balanced accuracy does not differentiate the Type 1 Error from the Type 2 Error, nor does the accuracy. That's where precision and recall step in.

Precision

The precision metric indicates how many values the model predicted as positive were actually positive.
It is a proportion of True Positive predictions out of all the Positive predictions.

Using the precision metric, we can understand how frequent the Type 1 Error is. High precision – Type 1 Error is rare and low precision means Type 1 Error is frequent.

Recall

A recall metric shows what proportion of actually positive values are predicted correctly.

Recall metric gives an understanding of how frequent the Type 2 Error is. High recall means that Type 2 Error is rare and low recall means that Type 2 Error is frequent.

The problem with precision and recall metrics is that the model that predicts only a positive(1) class will have a perfect recall. However, its precision would be bad.
Also, the model that predicts one positive instance correctly and all other instances to be negative will get perfect precision, but the recall would be awful.
So we can easily build a model with perfect precision or perfect recall, but it is much more challenging to build a model with both good recall and precision. So it is important to consider both precision and recall. Luckily there is a metric that does so.

F1 Score

The F1 score is a harmonic mean of precision and recall. The harmonic mean here is preferable over the regular mean since it penalizes one of the summands being low stronger.

F1 combines both precision and recall in one metric. F1 will be good only if both precision and recall are relatively high.

Choosing the metric comes down to what your task is. The accuracy (or balanced accuracy for imbalanced datasets) is intuitive and gives a good understanding of how the model performs overall. If you need to be more specific about errors made by a model, precision can indicate Type 1 errors, while recall can identify Type 2 errors. And F1 score shows how balanced are Type 1 and Type 2 Errors.

Metrics in Python

Scikit-learn implements all those metrics. They can be found in sklearn.metrics module:

Everything was clear?

Section 5. Chapter 2
course content

Course Content

Classification with Python

Metrics

By now, we have used the accuracy metric to measure the model's performance. This chapter shows the disadvantages of this metric and introduces several more metrics to fight those problems.
Let's first remember the TP, TN, FN, and FP part of the last chapter.

Accuracy

The accuracy shows a proportion of correct predictions.

But the accuracy has its disadvantages.
Suppose you are trying to predict whether the patient has a rare disease. The dataset contains 99.9% of healthy patients and 0.01% of patients with the disease. Then always predicting the patient to be healthy will give 0.999 accuracy, although a model like this is entirely useless.
Datasets like this are called imbalanced, and balanced accuracy helps to deal with those.

Balanced accuracy

Balanced accuracy calculates the proportion of correct positive predictions and the proportion of correct negative predictions separately and averages them. This means it gives equal importance to each class, regardless of its size.

In the rare disease example, balanced accuracy will equal 0.5 for an always-healthy model. This problem is solved.
But still, balanced accuracy does not differentiate the Type 1 Error from the Type 2 Error, nor does the accuracy. That's where precision and recall step in.

Precision

The precision metric indicates how many values the model predicted as positive were actually positive.
It is a proportion of True Positive predictions out of all the Positive predictions.

Using the precision metric, we can understand how frequent the Type 1 Error is. High precision – Type 1 Error is rare and low precision means Type 1 Error is frequent.

Recall

A recall metric shows what proportion of actually positive values are predicted correctly.

Recall metric gives an understanding of how frequent the Type 2 Error is. High recall means that Type 2 Error is rare and low recall means that Type 2 Error is frequent.

The problem with precision and recall metrics is that the model that predicts only a positive(1) class will have a perfect recall. However, its precision would be bad.
Also, the model that predicts one positive instance correctly and all other instances to be negative will get perfect precision, but the recall would be awful.
So we can easily build a model with perfect precision or perfect recall, but it is much more challenging to build a model with both good recall and precision. So it is important to consider both precision and recall. Luckily there is a metric that does so.

F1 Score

The F1 score is a harmonic mean of precision and recall. The harmonic mean here is preferable over the regular mean since it penalizes one of the summands being low stronger.

F1 combines both precision and recall in one metric. F1 will be good only if both precision and recall are relatively high.

Choosing the metric comes down to what your task is. The accuracy (or balanced accuracy for imbalanced datasets) is intuitive and gives a good understanding of how the model performs overall. If you need to be more specific about errors made by a model, precision can indicate Type 1 errors, while recall can identify Type 2 errors. And F1 score shows how balanced are Type 1 and Type 2 Errors.

Metrics in Python

Scikit-learn implements all those metrics. They can be found in sklearn.metrics module:

Everything was clear?

Section 5. Chapter 2
some-alt