Confusion Matrix

When we make a prediction for a binary classification problem, there are only four possible outcomes.

Note

In the image above, the actual values are in descending order, and the predicted values are in ascending. This is the layout used in the Scikit-learn for the confusion matrix(learned later in the chapter). You may encounter different layouts in other visualizations, but nothing apart from the order changes.

We call those outcomes True Positive(TP), True Negative(TN), False Positive(FP), False Negative(FN), where True/False stands for whether the prediction is correct and Positive/Negative stands for what is the predicted class 1 or 0.
So we can make two types of errors: False Positive and False Negative.
The case of False Positive prediction is also called a Type 1 Error.
And the case of False Negative prediction – Type 2 Error.

Confusion Matrix

The first way to look at the model's performance is to organize the predictions into a confusion matrix like this:

You can build a confusion matrix in Python using the confusion_matrix() from sklearn.

from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_true, y_pred)

And, for better visualization, you can use the heatmap() function of sns(seaborn).

sns.heatmap(conf_matrix);

Here is an example of calculating the confusion matrix for a Random Forest prediction on the titanic dataset:


              12345678910111213141516
            
import pandas as pd 
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Read the data and assign the variables
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/titanic.csv')
X, y = df.drop('Survived', axis=1), df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Build and train a Random Forest and predict target for a test set
random_forest = RandomForestClassifier().fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
# Build a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True);

We can also plot the percentages instead of the instances count using the normalize parameter:

conf_matrix = confusion_matrix(y_true, y_pred, normalize='all')


              12345678910111213141516
            
import pandas as pd 
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Read the data and assign the variables
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/titanic.csv')
X, y = df.drop('Survived', axis=1), df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Build and train a Random Forest and predict target for a test set
random_forest = RandomForestClassifier().fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
# Build a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred, normalize='all')
sns.heatmap(conf_matrix, annot=True);

Everything was clear?

Thanks for your feedback!

Section 5. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Classification with Python

Confusion Matrix

When we make a prediction for a binary classification problem, there are only four possible outcomes.

Note

In the image above, the actual values are in descending order, and the predicted values are in ascending. This is the layout used in the Scikit-learn for the confusion matrix(learned later in the chapter). You may encounter different layouts in other visualizations, but nothing apart from the order changes.

Confusion Matrix

The first way to look at the model's performance is to organize the predictions into a confusion matrix like this:

You can build a confusion matrix in Python using the confusion_matrix() from sklearn.

from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_true, y_pred)

And, for better visualization, you can use the heatmap() function of sns(seaborn).

sns.heatmap(conf_matrix);

Here is an example of calculating the confusion matrix for a Random Forest prediction on the titanic dataset:


              12345678910111213141516
            
import pandas as pd 
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Read the data and assign the variables
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/titanic.csv')
X, y = df.drop('Survived', axis=1), df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Build and train a Random Forest and predict target for a test set
random_forest = RandomForestClassifier().fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
# Build a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True);

We can also plot the percentages instead of the instances count using the normalize parameter:

conf_matrix = confusion_matrix(y_true, y_pred, normalize='all')


              12345678910111213141516
            
import pandas as pd 
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Read the data and assign the variables
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/titanic.csv')
X, y = df.drop('Survived', axis=1), df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Build and train a Random Forest and predict target for a test set
random_forest = RandomForestClassifier().fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
# Build a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred, normalize='all')
sns.heatmap(conf_matrix, annot=True);

Everything was clear?

Thanks for your feedback!

Section 5. Chapter 1