What is Pipeline

In the previous section, we completed three preprocessing steps: imputing, encoding, and scaling.

We did it step by step, transforming the needed columns and collecting them back to the X array. It is a tedious process, especially when there is an OneHotEncoder that changes the number of columns.

Another problem with it is that to make a prediction, new instances should go through the same preprocessing steps, so we would need to perform all those transformations again.

Luckily, Scikit-learn provides a Pipeline class – a simple way to collect all those transformations together, so it is easier to transform both training data and new instances.

A Pipeline serves as a container for a sequence of transformers, and eventually, an estimator. When you invoke the .fit_transform() method on a Pipeline, it sequentially applies the .fit_transform() method of each transformer to the data.

# Create a pipeline with three steps: imputation, one-hot encoding, and scaling
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Step 1: Impute missing values
    ('encoder', OneHotEncoder()),                         # Step 2: Convert categorical data
    ('scaler', StandardScaler())                          # Step 3: Scale the data
])

# Fit and transform the data using the pipeline
X_transformed = pipeline.fit_transform(X)

This streamlined approach means you only need to call .fit_transform() once on the training set and subsequently use the .transform() method to process new instances.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

ML Introduction with scikit-learn

1. Machine Learning Concepts

What is ML Types of Machine Learning Training Set Types of Data Machine Learning Workflow

2. Preprocessing Data with Scikit-learn

3. Pipelines

What is Pipeline ColumnTransformer Efficient Data Preprocessing with Pipelines Challenge: Creating a Pipeline Final Estimator Challenge: Creating a Complete ML Pipeline

4. Modeling