Lära Preprocessing Objects in Practice | Transformers and Preprocessing Workflows

Svep för att visa menyn

When working with real-world data, you often encounter missing values, inconsistent formats, and categorical variables that need to be converted to numbers before modeling. scikit-learn provides dedicated preprocessing objects, known as transformers, to handle these common data cleaning tasks. Two of the most widely used are SimpleImputer for filling in missing values and OneHotEncoder for converting categorical string features into numeric arrays. These preprocessing objects are designed to fit seamlessly into the scikit-learn workflow, following the familiar fit, transform, and fit_transform pattern you have already learned.


              12345678910111213141516
            
import numpy as np
from sklearn.impute import SimpleImputer

# Sample data with missing values
data = np.array([
    [1, 2, np.nan],
    [4, np.nan, 6],
    [7, 8, 9]
])

# Create a SimpleImputer to fill missing values with the mean of each column
imputer = SimpleImputer(strategy="mean")
imputed_data = imputer.fit_transform(data)

print("Imputed data:")
print(imputed_data)

After handling missing values, you often need to encode categorical data so it can be used in machine learning models. OneHotEncoder is a transformer that converts categorical features into a one-hot numeric array, making them suitable for algorithms that require numerical input. Chaining these preprocessing steps — imputation followed by encoding — demonstrates a typical transformer workflow: each step prepares the data for the next, and each transformer follows the same API, making it easy to build robust data cleaning pipelines.


              123456789101112131415161718192021222324
            
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Example data: missing values and categorical column
data = np.array([
    ["red", 1],
    [np.nan, 2],
    ["blue", 1],
    ["green", np.nan]
], dtype=object)

# First, impute missing values in the categorical column with the most frequent value
cat_imputer = SimpleImputer(strategy="most_frequent")
imputed_data = cat_imputer.fit_transform(data)

# Now, apply OneHotEncoder to the first column (categorical feature)
encoder = OneHotEncoder(sparse_output=False)
encoded_feature = encoder.fit_transform(imputed_data[:, [0]])

print("Imputed data:")
print(imputed_data)
print("One-hot encoded categorical feature:")
print(encoded_feature)

Preprocessing objects like SimpleImputer and OneHotEncoder are essential building blocks in a scikit-learn workflow. By following the transformer pattern, they allow you to chain multiple data cleaning steps in a consistent and reproducible way. This structure enables you to prepare your data efficiently before passing it to estimators for modeling, ensuring that your workflow is both robust and easy to maintain.

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 2. Kapitel 2

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 2. Kapitel 2