Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära ColumnTransformer for Mixed Data | Pipelines and Composition Patterns
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Mastering scikit-learn API and Workflows

bookColumnTransformer for Mixed Data

When working with real-world datasets, you often encounter tables that contain both numerical and categorical features. For instance, a customer dataset may include columns like age (numerical) and city (categorical). Since different types of data require different preprocessing steps—such as scaling for numerical features and encoding for categorical ones—using a single transformer is not sufficient. This is where the ColumnTransformer becomes essential. It allows you to specify exactly which transformations should be applied to each subset of columns, enabling you to prepare heterogeneous data efficiently and consistently for modeling.

1234567891011121314151617181920212223242526
import pandas as pd from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder # Sample DataFrame with mixed data types data = pd.DataFrame({ "age": [25, 32, 47, 51], "income": [50000, 64000, 120000, 98000], "city": ["New York", "San Francisco", "Chicago", "New York"] }) # Define columns by data type numeric_features = ["age", "income"] categorical_features = ["city"] # Create a ColumnTransformer preprocessor = ColumnTransformer( transformers=[ ("num", StandardScaler(), numeric_features), ("cat", OneHotEncoder(), categorical_features) ] ) # Fit and transform the data transformed = preprocessor.fit_transform(data) print(transformed)
copy

The ColumnTransformer is designed to work seamlessly with the Pipeline object you learned about earlier. By placing your preprocessor as the first step in a pipeline, you can chain together preprocessing and modeling into a single, reusable workflow. This approach keeps your code organized and ensures that each transformation is applied consistently during both training and prediction. Integrating a ColumnTransformer within a pipeline makes it easy to handle mixed data types and maintain a clean, production-ready workflow.

question mark

What is the main purpose of using a ColumnTransformer in scikit-learn when working with real-world datasets that contain both numerical and categorical features?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 3. Kapitel 2

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Suggested prompts:

Can you explain how the ColumnTransformer handles missing values?

How do I integrate the ColumnTransformer with a machine learning model in a pipeline?

Can you show how to inverse transform the processed data back to its original form?

bookColumnTransformer for Mixed Data

Svep för att visa menyn

When working with real-world datasets, you often encounter tables that contain both numerical and categorical features. For instance, a customer dataset may include columns like age (numerical) and city (categorical). Since different types of data require different preprocessing steps—such as scaling for numerical features and encoding for categorical ones—using a single transformer is not sufficient. This is where the ColumnTransformer becomes essential. It allows you to specify exactly which transformations should be applied to each subset of columns, enabling you to prepare heterogeneous data efficiently and consistently for modeling.

1234567891011121314151617181920212223242526
import pandas as pd from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder # Sample DataFrame with mixed data types data = pd.DataFrame({ "age": [25, 32, 47, 51], "income": [50000, 64000, 120000, 98000], "city": ["New York", "San Francisco", "Chicago", "New York"] }) # Define columns by data type numeric_features = ["age", "income"] categorical_features = ["city"] # Create a ColumnTransformer preprocessor = ColumnTransformer( transformers=[ ("num", StandardScaler(), numeric_features), ("cat", OneHotEncoder(), categorical_features) ] ) # Fit and transform the data transformed = preprocessor.fit_transform(data) print(transformed)
copy

The ColumnTransformer is designed to work seamlessly with the Pipeline object you learned about earlier. By placing your preprocessor as the first step in a pipeline, you can chain together preprocessing and modeling into a single, reusable workflow. This approach keeps your code organized and ensures that each transformation is applied consistently during both training and prediction. Integrating a ColumnTransformer within a pipeline makes it easy to handle mixed data types and maintain a clean, production-ready workflow.

question mark

What is the main purpose of using a ColumnTransformer in scikit-learn when working with real-world datasets that contain both numerical and categorical features?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 3. Kapitel 2
some-alt