Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Evaluating the Model | Modeling
ML Introduction with scikit-learn

book
Evaluating the Model

When building a model for predictions, it is essential to understand how well the model performs before making any actual predictions.

Evaluating a model involves assessing its performance in making predictions. This is why the .score() method is important.

However, evaluating the model using the training set data can yield unreliable results because a model is likely to perform better on data it was trained on than on new, unseen data. Therefore, it is crucial to evaluate the model on data it has never seen before to truly understand its performance.

In more formal terms, we want a model that generalizes well.

We can do this by randomly splitting the data into a training set and a test set.

Now we can train the model on the training set and evaluate its performance on the test set.

python
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

To randomly split the data, we can use the train_test_split() function from the sklearn.model_selection module.

Typically, for a test set, we use 25-40% of the data when the dataset is small, 10-30% for a medium-sized dataset, and less than 10% for large datasets.

In our example, with only 342 instances — classified as a small dataset — we will allocate 33% of the data for the test set.

python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

We refer to the training set as X_train and y_train, and the test set as X_test and y_test.

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv')
# Assign X, y variables (X is already preprocessed and y is already encoded)
X, y = df.drop('species', axis=1), df['species']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# Initialize and train a model
knn5 = KNeighborsClassifier().fit(X_train, y_train) # Trained 5 neighbors model
knn1 = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train) # Trained 1 neighbor model
# Print the scores of both models
print('5 Neighbors score:',knn5.score(X_test, y_test))
print('1 Neighbor score:',knn1.score(X_test, y_test))
123456789101112131415
import pandas as pd from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Train-test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) # Initialize and train a model knn5 = KNeighborsClassifier().fit(X_train, y_train) # Trained 5 neighbors model knn1 = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train) # Trained 1 neighbor model # Print the scores of both models print('5 Neighbors score:',knn5.score(X_test, y_test)) print('1 Neighbor score:',knn1.score(X_test, y_test))
copy

Notice that now we use the training set in the .fit(X_train, y_train) and the test set in the .score(X_test, y_test).

Since the train_test_split() splits the dataset randomly, each time you run the code, there are different train and test sets. You can run it several times and see that the scores differ. These scores would become more stable if the dataset's size increased.

question mark

To achieve a 67%/33% train-test split, we take one third first rows as the test set and remaining as a training set. Is this statement correct?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 4. Capítulo 3

Pergunte à IA

expand
ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

We use cookies to make your experience better!
some-alt