Train-test Split. Cross Validation

In the previous chapters, we built the models and predicted new values. But we have no idea how well the model performs and whether those predictions are trustworthy.

Train-test split

To measure the model's performance, we need the subset of labeled data that the model had not seen. So we randomly split all the labeled data into training set and test set.

This is achievable using the train_test_split() function of sklearn.

Usually, you split the model around 70-90% for the training set and 10-30% for the test set. However, tens of thousands of test instances are more than enough, so there is no need to use even 10% if your dataset is large(millions of instances).
Now we can train the model using the training set and calculate its accuracy on the test set.


              123456789101112131415161718
            
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/starwars_binary.csv')
X = df[['StarWars4_rate', 'StarWars5_rate']] # Store feature columns as `X`
y = df['StarWars6'] # Store target column as `y`
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Note that we use only transform for `X_test`
# Initialize a model
knn = KNeighborsClassifier(n_neighbors=3).fit(X_train_scaled, y_train)
# Print the accuracy on the test set
print(knn.score(X_test_scaled, y_test))

But this approach has some flaws:

We do not use all the available data for training, which could improve our model;
Since we evaluate the model's accuracy on a small portion of data(test set), this accuracy score can be unreliable on smaller datasets (you can run the code above multiple times and see how the accuracy changes each time a new test set is sampled).

Cross-validation

The cross-validation is designed for fighting those problems. Its idea is to shuffle the whole set, split it into 5 equal parts(folds), and run 5 iterations where you will use 4 parts for training and 1 as a test set.

So we train five models with little different datasets. At each, we calculate the test set accuracy. Once we've done that, we can take an average of those 5 accuracy scores, which will be our cross-validation accuracy score. It is more reliable since we calculated the accuracy score on all our data, just used five iterations for that.
Now we know how well the model performs and can re-train the model using the whole dataset.

Note

You can use the number of folds other than five. Say some number n. Then you will use one fold for the test set and n-1 for the training set. The following function makes it easy to configure such things.

Here is an example of usage:


              1234567891011121314151617
            
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import pandas as pd
from sklearn.model_selection import cross_val_score

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/starwars_binary.csv')
X = df[['StarWars4_rate', 'StarWars5_rate']] # Store feature columns as `X`
y = df['StarWars6'] # Store target column as `y`
# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Initialize a model
knn = KNeighborsClassifier(n_neighbors=3)
# Print the accuracy on the test set
scores = cross_val_score(knn, X_scaled, y, cv=5)
print('Scores: ', scores)
print('Average score:', scores.mean())

The score used by default for classification is accuracy:

So only around 75% of predictions are correct. But maybe with different n_neighbors, the accuracy will be better? It will! The following chapter covers choosing the n_neighbors(or k) with the highest cross-validation accuracy.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 6

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Classification with Python