Learn Final Estimator

Up to this point, we've used a Pipeline primarily for preprocessing. However, preprocessing is typically not the final goal. After preprocessing, the transformed data is usually fed into a predictor (model) to generate insights or make predictions.

This is why the Pipeline class is designed to include an estimator as its final step, which is often a predictor. The illustration below shows how a Pipeline functions when its last component is a predictor.

Why .trasnform()?

The pipeline uses the .transform() method rather than .fit_transform() when processing new data instances for predictions to ensure consistent data transformation across both training and test sets.

For example, let's consider a scenario involving a dataset with a single categorical feature, 'Color', that needs encoding before model training:

Here is how one-hot encoded training data looks like:

Here are the new instances to predict:

If we use .fit_transform() on these new instances, the OneHotEncoder could potentially create new columns in a different order. As a result, new instances would be transformed differently from the training set, and prediction would be unreliable.

However, using .transform() ensures that the new data is encoded exactly as the training data, ignoring categories not seen during training:

Adding the Final Estimator

To use the final estimator, you just need to add it as the last step of the pipeline. For example, in the next chapter, we will use a KNeighborsClassifier model as a final estimator.

The syntax is as follows:

# Creating a pipeline
pipe = make_pipeline(ct, 
                     SimpleImputer(strategy='most_frequent'),
					 StandardScaler(),
                     KNeighborsClassifier()
                    )
# Training a model using pipeline
pipe.fit(X, y)
# Predicting new instances
pipe.predict(X_new)

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 5

Ask AI

Ask anything or try one of the suggested questions to begin our chat