Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Evaluating a Model. Train-Test split. | Modeling
ML Introduction with scikit-learn
course content

Course Content

ML Introduction with scikit-learn

ML Introduction with scikit-learn

1. Machine Learning Concepts
2. Preprocessing Data with Scikit-learn
3. Pipelines
4. Modeling

bookEvaluating a Model. Train-Test split.

When we build a model for predictions, it is essential to understand how well the model performs before actually predicting something.
Evaluating a model refers to a process of assessing how well it performs in making predictions.
That's why the .score() method is needed.
But if we evaluate the model using the training set data, the results are unreliable since a model is likely to perform better on the data it was trained on rather than on the data it has never seen.
So it is crucial to evaluate the model on the data it has never seen to understand how well it will perform.

We can do this by randomly splitting the data into a training set and a test set.

Now we can train a model on a training set and evaluate its performance on a test set.

To randomly split the data, we can use the train_test_split() function from the sklearn.model_selection module.

Usually, for a test set, we use 25-40% of data when the dataset is small, 10-30% for a medium dataset, and <10% for large datasets.
In our example, there are only 342 instances, a small dataset. So we will use 33% as a test set.
Here is a syntax:

We call a training set X_train, y_train and a test – X_test, y_test.

123456789101112131415
import pandas as pd from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Train-test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) # Initialize and train a model knn5 = KNeighborsClassifier().fit(X_train, y_train) # Trained 5 neighbors model knn1 = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train) # Trained 1 neighbor model # Print the scores of both models print('5 Neighbors score:',knn5.score(X_test, y_test)) print('1 Neighbor score:',knn1.score(X_test, y_test))
copy

Notice that now we use the training set in the .fit(X_train, y_train) and the test set in the .score(X_test, y_test).
Since the train_test_split() splits the dataset randomly, each time you press a Run Code button, there are different train and test sets. You can press it several times and see that the scores differ. These scores would become more stable if the dataset's size increased.

To achieve a 67%/33% train-test split, we take one third first rows as the test set and remaining as a training set. Is this statement correct?

To achieve a 67%/33% train-test split, we take one third first rows as the test set and remaining as a training set. Is this statement correct?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 4. Chapter 3
We're sorry to hear that something went wrong. What happened?
some-alt