Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Preprocessing Objects in Practice | Transformers and Preprocessing Workflows
Mastering scikit-learn API and Workflows

bookPreprocessing Objects in Practice

When working with real-world data, you often encounter missing values, inconsistent formats, and categorical variables that need to be converted to numbers before modeling. scikit-learn provides dedicated preprocessing objects, known as transformers, to handle these common data cleaning tasks. Two of the most widely used are SimpleImputer for filling in missing values and OneHotEncoder for converting categorical string features into numeric arrays. These preprocessing objects are designed to fit seamlessly into the scikit-learn workflow, following the familiar fit, transform, and fit_transform pattern you have already learned.

12345678910111213141516
import numpy as np from sklearn.impute import SimpleImputer # Sample data with missing values data = np.array([ [1, 2, np.nan], [4, np.nan, 6], [7, 8, 9] ]) # Create a SimpleImputer to fill missing values with the mean of each column imputer = SimpleImputer(strategy="mean") imputed_data = imputer.fit_transform(data) print("Imputed data:") print(imputed_data)
copy

After handling missing values, you often need to encode categorical data so it can be used in machine learning models. OneHotEncoder is a transformer that converts categorical features into a one-hot numeric array, making them suitable for algorithms that require numerical input. Chaining these preprocessing steps — imputation followed by encoding — demonstrates a typical transformer workflow: each step prepares the data for the next, and each transformer follows the same API, making it easy to build robust data cleaning pipelines.

123456789101112131415161718192021222324
import numpy as np from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder # Example data: missing values and categorical column data = np.array([ ["red", 1], [np.nan, 2], ["blue", 1], ["green", np.nan] ], dtype=object) # First, impute missing values in the categorical column with the most frequent value cat_imputer = SimpleImputer(strategy="most_frequent") imputed_data = cat_imputer.fit_transform(data) # Now, apply OneHotEncoder to the first column (categorical feature) encoder = OneHotEncoder(sparse_output=False) encoded_feature = encoder.fit_transform(imputed_data[:, [0]]) print("Imputed data:") print(imputed_data) print("One-hot encoded categorical feature:") print(encoded_feature)
copy

Preprocessing objects like SimpleImputer and OneHotEncoder are essential building blocks in a scikit-learn workflow. By following the transformer pattern, they allow you to chain multiple data cleaning steps in a consistent and reproducible way. This structure enables you to prepare your data efficiently before passing it to estimators for modeling, ensuring that your workflow is both robust and easy to maintain.

question mark

Why are preprocessing objects like SimpleImputer and OneHotEncoder important in scikit-learn workflows?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 2. Kapitel 2

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Suggested prompts:

Can you explain how the SimpleImputer chooses which value to use for imputation?

How does OneHotEncoder handle unseen categories during prediction?

Can you show how to combine these preprocessing steps into a single pipeline?

bookPreprocessing Objects in Practice

Svep för att visa menyn

When working with real-world data, you often encounter missing values, inconsistent formats, and categorical variables that need to be converted to numbers before modeling. scikit-learn provides dedicated preprocessing objects, known as transformers, to handle these common data cleaning tasks. Two of the most widely used are SimpleImputer for filling in missing values and OneHotEncoder for converting categorical string features into numeric arrays. These preprocessing objects are designed to fit seamlessly into the scikit-learn workflow, following the familiar fit, transform, and fit_transform pattern you have already learned.

12345678910111213141516
import numpy as np from sklearn.impute import SimpleImputer # Sample data with missing values data = np.array([ [1, 2, np.nan], [4, np.nan, 6], [7, 8, 9] ]) # Create a SimpleImputer to fill missing values with the mean of each column imputer = SimpleImputer(strategy="mean") imputed_data = imputer.fit_transform(data) print("Imputed data:") print(imputed_data)
copy

After handling missing values, you often need to encode categorical data so it can be used in machine learning models. OneHotEncoder is a transformer that converts categorical features into a one-hot numeric array, making them suitable for algorithms that require numerical input. Chaining these preprocessing steps — imputation followed by encoding — demonstrates a typical transformer workflow: each step prepares the data for the next, and each transformer follows the same API, making it easy to build robust data cleaning pipelines.

123456789101112131415161718192021222324
import numpy as np from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder # Example data: missing values and categorical column data = np.array([ ["red", 1], [np.nan, 2], ["blue", 1], ["green", np.nan] ], dtype=object) # First, impute missing values in the categorical column with the most frequent value cat_imputer = SimpleImputer(strategy="most_frequent") imputed_data = cat_imputer.fit_transform(data) # Now, apply OneHotEncoder to the first column (categorical feature) encoder = OneHotEncoder(sparse_output=False) encoded_feature = encoder.fit_transform(imputed_data[:, [0]]) print("Imputed data:") print(imputed_data) print("One-hot encoded categorical feature:") print(encoded_feature)
copy

Preprocessing objects like SimpleImputer and OneHotEncoder are essential building blocks in a scikit-learn workflow. By following the transformer pattern, they allow you to chain multiple data cleaning steps in a consistent and reproducible way. This structure enables you to prepare your data efficiently before passing it to estimators for modeling, ensuring that your workflow is both robust and easy to maintain.

question mark

Why are preprocessing objects like SimpleImputer and OneHotEncoder important in scikit-learn workflows?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 2. Kapitel 2
some-alt