Putting It All Together
In this challenge, you will apply everything you learned throughout the course. Here are the steps you need to take:
- Remove the rows that hold too little information;
- Encode the
y
; - Split the dataset into training and test sets;
- Build a pipeline with all the preprocessing steps and the
GridSearchCV
as the final estimator to find the best hyperparameters; - Train the model using the pipeline;
- Evaluate the model using the pipeline;
- Predict the target for
X_new
and decode it using theLabelEncoder
's.inverse_transform()
.
Let's get to w̵o̵r̵k̵ code!





Завдання
Swipe to start coding
- Encode the target using
LabelEncoder
. - Split the data so that 33% is used for a test set and the rest – for a training set.
- Make a
ColumnTransformer
to encode only the'island'
and'sex'
columns. Make the others remain untouched. Use a proper encoder for nominal data. - Fill the gaps in a
param_grid
to try the following values for the number of neighbors:[1, 3, 5, 7, 9, 12, 15, 20, 25]
. - Create a
GridSearchCV
object with theKNeighborsClassifier
as a model. - Make a pipeline with
ct
as a first step andgrid_search
as a final estimator. - Train the model using a pipeline on the training set.
- Evaluate the model on the test set. (Print its score)
- Get a predicted target for
X_test
. - Print the best estimator found by
grid_search
.
Рішення
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv')
# Removing rows with more than 1 null
df = df[df.isna().sum(axis=1) < 2]
# Assigining X, y variables
X, y = df.drop('species', axis=1), df['species']
# Encode the target
label_enc = LabelEncoder()
y = label_enc.fit_transform(y)
# Make a train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# Create the ColumnTransformer for encoding features
ct = make_column_transformer((OneHotEncoder(), ['island', 'sex']),
remainder='passthrough')
# Make a param_grid for the grid search and initialize the GridSearchCV object
param_grid = {'n_neighbors': [1, 3, 5, 7, 9, 12, 15, 20, 25],
'weights': ['distance', 'uniform'],
'p': [1, 2]
}
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid)
# Make a Pipeline of ct, SimpleImputer, and StandardScaler
pipe = make_pipeline(ct,
SimpleImputer(strategy='most_frequent'),
StandardScaler(),
grid_search
)
# Train the model
pipe.fit(X_train, y_train)
# Print score
Все було зрозуміло?
Дякуємо за ваш відгук!
Секція 4. Розділ 10
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv')
# Removing rows with more than 1 null
df = df[df.isna().sum(axis=1) < 2]
# Assigining X, y variables
X, y = df.drop('species', axis=1), df['species']
# Encode the target
label_enc = LabelEncoder()
y = label_enc.___(y)
# Make a train-test split
X_train, X_test, y_train, y_test = ___(X, y, ___=0.33)
# Create the ColumnTransformer for encoding features
ct = make_column_transformer((___(), ['island', 'sex']),
remainder='___')
# Make a param_grid for the grid search and initialize the GridSearchCV object
param_grid = {'n_neighbors': ___,
'weights': ['distance', 'uniform'],
'p': [1, 2]
}
grid_search = ___(KNeighborsClassifier(), ___)
# Make a Pipeline of ct, SimpleImputer, and StandardScaler
pipe = make_pipeline(___,
SimpleImputer(strategy='most_frequent'),
StandardScaler(),
___
)
# Train the model
pipe.___(___, ___)
# Print score
print(pipe.___(X_test, ___))
# Print predictions
y_pred = pipe.___(X_test) # Get encoded predictions
print(label_enc.inverse_transform(y_pred[:5])) # Decode predictions and print 5 first
# Print the best estimator
print(grid_search.___)