Preprocessing Objects in Practice
When working with real-world data, you often encounter missing values, inconsistent formats, and categorical variables that need to be converted to numbers before modeling. scikit-learn provides dedicated preprocessing objects, known as transformers, to handle these common data cleaning tasks. Two of the most widely used are SimpleImputer for filling in missing values and OneHotEncoder for converting categorical string features into numeric arrays. These preprocessing objects are designed to fit seamlessly into the scikit-learn workflow, following the familiar fit, transform, and fit_transform pattern you have already learned.
12345678910111213141516import numpy as np from sklearn.impute import SimpleImputer # Sample data with missing values data = np.array([ [1, 2, np.nan], [4, np.nan, 6], [7, 8, 9] ]) # Create a SimpleImputer to fill missing values with the mean of each column imputer = SimpleImputer(strategy="mean") imputed_data = imputer.fit_transform(data) print("Imputed data:") print(imputed_data)
After handling missing values, you often need to encode categorical data so it can be used in machine learning models. OneHotEncoder is a transformer that converts categorical features into a one-hot numeric array, making them suitable for algorithms that require numerical input. Chaining these preprocessing steps — imputation followed by encoding — demonstrates a typical transformer workflow: each step prepares the data for the next, and each transformer follows the same API, making it easy to build robust data cleaning pipelines.
123456789101112131415161718192021222324import numpy as np from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder # Example data: missing values and categorical column data = np.array([ ["red", 1], [np.nan, 2], ["blue", 1], ["green", np.nan] ], dtype=object) # First, impute missing values in the categorical column with the most frequent value cat_imputer = SimpleImputer(strategy="most_frequent") imputed_data = cat_imputer.fit_transform(data) # Now, apply OneHotEncoder to the first column (categorical feature) encoder = OneHotEncoder(sparse_output=False) encoded_feature = encoder.fit_transform(imputed_data[:, [0]]) print("Imputed data:") print(imputed_data) print("One-hot encoded categorical feature:") print(encoded_feature)
Preprocessing objects like SimpleImputer and OneHotEncoder are essential building blocks in a scikit-learn workflow. By following the transformer pattern, they allow you to chain multiple data cleaning steps in a consistent and reproducible way. This structure enables you to prepare your data efficiently before passing it to estimators for modeling, ensuring that your workflow is both robust and easy to maintain.
Takk for tilbakemeldingene dine!
Spør AI
Spør AI
Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår
Can you explain how the SimpleImputer chooses which value to use for imputation?
How does OneHotEncoder handle unseen categories during prediction?
Can you show how to combine these preprocessing steps into a single pipeline?
Fantastisk!
Completion rate forbedret til 5.26
Preprocessing Objects in Practice
Sveip for å vise menyen
When working with real-world data, you often encounter missing values, inconsistent formats, and categorical variables that need to be converted to numbers before modeling. scikit-learn provides dedicated preprocessing objects, known as transformers, to handle these common data cleaning tasks. Two of the most widely used are SimpleImputer for filling in missing values and OneHotEncoder for converting categorical string features into numeric arrays. These preprocessing objects are designed to fit seamlessly into the scikit-learn workflow, following the familiar fit, transform, and fit_transform pattern you have already learned.
12345678910111213141516import numpy as np from sklearn.impute import SimpleImputer # Sample data with missing values data = np.array([ [1, 2, np.nan], [4, np.nan, 6], [7, 8, 9] ]) # Create a SimpleImputer to fill missing values with the mean of each column imputer = SimpleImputer(strategy="mean") imputed_data = imputer.fit_transform(data) print("Imputed data:") print(imputed_data)
After handling missing values, you often need to encode categorical data so it can be used in machine learning models. OneHotEncoder is a transformer that converts categorical features into a one-hot numeric array, making them suitable for algorithms that require numerical input. Chaining these preprocessing steps — imputation followed by encoding — demonstrates a typical transformer workflow: each step prepares the data for the next, and each transformer follows the same API, making it easy to build robust data cleaning pipelines.
123456789101112131415161718192021222324import numpy as np from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder # Example data: missing values and categorical column data = np.array([ ["red", 1], [np.nan, 2], ["blue", 1], ["green", np.nan] ], dtype=object) # First, impute missing values in the categorical column with the most frequent value cat_imputer = SimpleImputer(strategy="most_frequent") imputed_data = cat_imputer.fit_transform(data) # Now, apply OneHotEncoder to the first column (categorical feature) encoder = OneHotEncoder(sparse_output=False) encoded_feature = encoder.fit_transform(imputed_data[:, [0]]) print("Imputed data:") print(imputed_data) print("One-hot encoded categorical feature:") print(encoded_feature)
Preprocessing objects like SimpleImputer and OneHotEncoder are essential building blocks in a scikit-learn workflow. By following the transformer pattern, they allow you to chain multiple data cleaning steps in a consistent and reproducible way. This structure enables you to prepare your data efficiently before passing it to estimators for modeling, ensuring that your workflow is both robust and easy to maintain.
Takk for tilbakemeldingene dine!