Final Estimator
Pipeline was previously used for preprocessing, but its real purpose is to chain preprocessing with a final predictor. The last step in a pipeline can be any estimator (typically a model) that produces predictions.
When calling .fit(), each transformer runs .fit_transform().
When calling .predict(), the pipeline uses .transform() before sending data to the final estimator.
This is required because new data must be transformed exactly like the training data.
Why .transform()?
Using .fit_transform() on new data could change encodings (e.g., in OneHotEncoder), creating mismatched columns and unreliable predictions.
.transform() guarantees consistent preprocessing, ignoring unseen categories and keeping the same column order.
Here is how one-hot encoded training data looks like:
Here are the new instances to predict:
If .fit_transform() were applied to new instances, the OneHotEncoder could generate columns in a different order or even introduce new ones. This would cause the new data to be transformed inconsistently with the training set, making predictions unreliable.
However, using .transform() ensures that the new data is encoded exactly as the training data, ignoring categories not seen during training:
Adding the Final Estimator
Simply add the model as the last step of the pipeline:
pipe = make_pipeline(
ct,
SimpleImputer(strategy='most_frequent'),
StandardScaler(),
KNeighborsClassifier()
)
pipe.fit(X, y)
pipe.predict(X_new)
This allows the whole workflowβpreprocessing + predictionβto run with one call.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Can you explain why using `.fit_transform()` on new data is problematic?
How does `.transform()` handle unseen categories in new data?
Can you show an example of a pipeline with a different final estimator?
Awesome!
Completion rate improved to 3.13
Final Estimator
Swipe to show menu
Pipeline was previously used for preprocessing, but its real purpose is to chain preprocessing with a final predictor. The last step in a pipeline can be any estimator (typically a model) that produces predictions.
When calling .fit(), each transformer runs .fit_transform().
When calling .predict(), the pipeline uses .transform() before sending data to the final estimator.
This is required because new data must be transformed exactly like the training data.
Why .transform()?
Using .fit_transform() on new data could change encodings (e.g., in OneHotEncoder), creating mismatched columns and unreliable predictions.
.transform() guarantees consistent preprocessing, ignoring unseen categories and keeping the same column order.
Here is how one-hot encoded training data looks like:
Here are the new instances to predict:
If .fit_transform() were applied to new instances, the OneHotEncoder could generate columns in a different order or even introduce new ones. This would cause the new data to be transformed inconsistently with the training set, making predictions unreliable.
However, using .transform() ensures that the new data is encoded exactly as the training data, ignoring categories not seen during training:
Adding the Final Estimator
Simply add the model as the last step of the pipeline:
pipe = make_pipeline(
ct,
SimpleImputer(strategy='most_frequent'),
StandardScaler(),
KNeighborsClassifier()
)
pipe.fit(X, y)
pipe.predict(X_new)
This allows the whole workflowβpreprocessing + predictionβto run with one call.
Thanks for your feedback!