Lernen What is Pipeline

In the previous section, we completed three preprocessing steps: imputing, encoding, and scaling.

We did it step by step, transforming the needed columns and collecting them back to the X array. It is a tedious process, especially when there is an OneHotEncoder that changes the number of columns.

Another problem with it is that to make a prediction, new instances should go through the same preprocessing steps, so we would need to perform all those transformations again.

Luckily, Scikit-learn provides a Pipeline class – a simple way to collect all those transformations together, so it is easier to transform both training data and new instances.

A Pipeline serves as a container for a sequence of transformers, and eventually, an estimator. When you invoke the .fit_transform() method on a Pipeline, it sequentially applies the .fit_transform() method of each transformer to the data.

# Create a pipeline with three steps: imputation, one-hot encoding, and scaling
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Step 1: Impute missing values
    ('encoder', OneHotEncoder()),                         # Step 2: Convert categorical data
    ('scaler', StandardScaler())                          # Step 3: Scale the data
])

# Fit and transform the data using the pipeline
X_transformed = pipeline.fit_transform(X)

This streamlined approach means you only need to call .fit_transform() once on the training set and subsequently use the .transform() method to process new instances.

War alles klar?

Danke für Ihr Feedback!

Abschnitt 3. Kapitel 1

Fragen Sie AI

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen