Preprocessing Objects in Practice
When working with real-world data, you often encounter missing values, inconsistent formats, and categorical variables that need to be converted to numbers before modeling. scikit-learn provides dedicated preprocessing objects, known as transformers, to handle these common data cleaning tasks. Two of the most widely used are SimpleImputer for filling in missing values and OneHotEncoder for converting categorical string features into numeric arrays. These preprocessing objects are designed to fit seamlessly into the scikit-learn workflow, following the familiar fit, transform, and fit_transform pattern you have already learned.
12345678910111213141516import numpy as np from sklearn.impute import SimpleImputer # Sample data with missing values data = np.array([ [1, 2, np.nan], [4, np.nan, 6], [7, 8, 9] ]) # Create a SimpleImputer to fill missing values with the mean of each column imputer = SimpleImputer(strategy="mean") imputed_data = imputer.fit_transform(data) print("Imputed data:") print(imputed_data)
After handling missing values, you often need to encode categorical data so it can be used in machine learning models. OneHotEncoder is a transformer that converts categorical features into a one-hot numeric array, making them suitable for algorithms that require numerical input. Chaining these preprocessing steps — imputation followed by encoding — demonstrates a typical transformer workflow: each step prepares the data for the next, and each transformer follows the same API, making it easy to build robust data cleaning pipelines.
123456789101112131415161718192021222324import numpy as np from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder # Example data: missing values and categorical column data = np.array([ ["red", 1], [np.nan, 2], ["blue", 1], ["green", np.nan] ], dtype=object) # First, impute missing values in the categorical column with the most frequent value cat_imputer = SimpleImputer(strategy="most_frequent") imputed_data = cat_imputer.fit_transform(data) # Now, apply OneHotEncoder to the first column (categorical feature) encoder = OneHotEncoder(sparse_output=False) encoded_feature = encoder.fit_transform(imputed_data[:, [0]]) print("Imputed data:") print(imputed_data) print("One-hot encoded categorical feature:") print(encoded_feature)
Preprocessing objects like SimpleImputer and OneHotEncoder are essential building blocks in a scikit-learn workflow. By following the transformer pattern, they allow you to chain multiple data cleaning steps in a consistent and reproducible way. This structure enables you to prepare your data efficiently before passing it to estimators for modeling, ensuring that your workflow is both robust and easy to maintain.
Дякуємо за ваш відгук!
Запитати АІ
Запитати АІ
Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат
Чудово!
Completion показник покращився до 5.26
Preprocessing Objects in Practice
Свайпніть щоб показати меню
When working with real-world data, you often encounter missing values, inconsistent formats, and categorical variables that need to be converted to numbers before modeling. scikit-learn provides dedicated preprocessing objects, known as transformers, to handle these common data cleaning tasks. Two of the most widely used are SimpleImputer for filling in missing values and OneHotEncoder for converting categorical string features into numeric arrays. These preprocessing objects are designed to fit seamlessly into the scikit-learn workflow, following the familiar fit, transform, and fit_transform pattern you have already learned.
12345678910111213141516import numpy as np from sklearn.impute import SimpleImputer # Sample data with missing values data = np.array([ [1, 2, np.nan], [4, np.nan, 6], [7, 8, 9] ]) # Create a SimpleImputer to fill missing values with the mean of each column imputer = SimpleImputer(strategy="mean") imputed_data = imputer.fit_transform(data) print("Imputed data:") print(imputed_data)
After handling missing values, you often need to encode categorical data so it can be used in machine learning models. OneHotEncoder is a transformer that converts categorical features into a one-hot numeric array, making them suitable for algorithms that require numerical input. Chaining these preprocessing steps — imputation followed by encoding — demonstrates a typical transformer workflow: each step prepares the data for the next, and each transformer follows the same API, making it easy to build robust data cleaning pipelines.
123456789101112131415161718192021222324import numpy as np from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder # Example data: missing values and categorical column data = np.array([ ["red", 1], [np.nan, 2], ["blue", 1], ["green", np.nan] ], dtype=object) # First, impute missing values in the categorical column with the most frequent value cat_imputer = SimpleImputer(strategy="most_frequent") imputed_data = cat_imputer.fit_transform(data) # Now, apply OneHotEncoder to the first column (categorical feature) encoder = OneHotEncoder(sparse_output=False) encoded_feature = encoder.fit_transform(imputed_data[:, [0]]) print("Imputed data:") print(imputed_data) print("One-hot encoded categorical feature:") print(encoded_feature)
Preprocessing objects like SimpleImputer and OneHotEncoder are essential building blocks in a scikit-learn workflow. By following the transformer pattern, they allow you to chain multiple data cleaning steps in a consistent and reproducible way. This structure enables you to prepare your data efficiently before passing it to estimators for modeling, ensuring that your workflow is both robust and easy to maintain.
Дякуємо за ваш відгук!