Course Content

ML Introduction with scikit-learn

1. Machine Learning Concepts

What is ML Types of Machine Learning Training Set Types of Data Machine Learning Workflow

2. Preprocessing Data with Scikit-learn

Scikit-learn Concepts Getting Familiar with Dataset Dealing with Missing Values Challenge: Imputing Missing Values OrdinalEncoder One-Hot Encoder LabelEncoder Challenge: Encoding Categorical Variables Why Scale the Data?StandardScaler, MinMaxScaler, MaxAbsScaler Challenge: Scaling the Features

3. Pipelines

What is Pipeline ColumnTransformer Efficient Data Preprocessing with Pipelines Challenge: Creating a Pipeline Final Estimator Challenge: Creating a Complete ML Pipeline

4. Modeling

Models KNeighborsClassifier Evaluating the Model Cross-Validation Challenge: Evaluating the Model with Cross-Validation GridSearchCV The Flaw of GridSearchCV Challenge: Tuning Hyperparameters with RandomizedSearchCV Modeling Summary Challenge: Putting It All Together

Evaluating the Model

When building a model for predictions, it is essential to understand how well the model performs before making any actual predictions.

Evaluating a model involves assessing its performance in making predictions. This is why the .score() method is important.

However, evaluating the model using the training set data can yield unreliable results because a model is likely to perform better on data it was trained on than on new, unseen data. Therefore, it is crucial to evaluate the model on data it has never seen before to truly understand its performance.

In more formal terms, we want a model that generalizes well.

We can do this by randomly splitting the data into a training set and a test set.

Now we can train the model on the training set and evaluate its performance on the test set.

model.fit(X_train, y_train)
print(model.score(X_test, y_test))

To randomly split the data, we can use the train_test_split() function from the sklearn.model_selection module.

Typically, for a test set, we use 25-40% of the data when the dataset is small, 10-30% for a medium-sized dataset, and less than 10% for large datasets.

In our example, with only 342 instances — classified as a small dataset — we will allocate 33% of the data for the test set.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

We refer to the training set as X_train and y_train, and the test set as X_test and y_test.


              123456789101112131415
            
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv')
# Assign X, y variables (X is already preprocessed and y is already encoded)
X, y = df.drop('species', axis=1), df['species']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# Initialize and train a model
knn5 = KNeighborsClassifier().fit(X_train, y_train) # Trained 5 neighbors model
knn1 = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train) # Trained 1 neighbor model
# Print the scores of both models
print('5 Neighbors score:',knn5.score(X_test, y_test))
print('1 Neighbor score:',knn1.score(X_test, y_test))

Notice that now we use the training set in the .fit(X_train, y_train) and the test set in the .score(X_test, y_test).

Since the train_test_split() splits the dataset randomly, each time you run the code, there are different train and test sets. You can run it several times and see that the scores differ. These scores would become more stable if the dataset's size increased.

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat