Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Oppiskele Composing Complex Workflows | Pipelines and Composition Patterns
Mastering scikit-learn API and Workflows

bookComposing Complex Workflows

When constructing advanced machine learning workflows, you often need to combine multiple preprocessing steps, feature engineering, and modeling in a flexible yet organized manner. scikit-learn enables this by allowing you to nest pipelines within each other and to use composition tools like FeatureUnion for parallel transformations. Nesting pipelines means you can create a pipeline as a step inside another pipeline, or use a ColumnTransformer as a step, further enhancing modularity. With FeatureUnion, you can apply several transformers to the same data in parallel and concatenate their outputs, which is especially useful for combining different feature extraction or engineering strategies.

1234567891011121314151617181920212223242526272829303132333435363738394041424344
import pandas as pd from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.linear_model import LogisticRegression # Example data data = pd.DataFrame({ "age": [25, 32, 47, 51, None], "income": [50000, 60000, 80000, None, 52000], "gender": ["M", "F", "F", "M", "F"] }) # Numeric and categorical feature lists numeric_features = ["age", "income"] categorical_features = ["gender"] # Numeric pipeline numeric_pipeline = Pipeline([ ("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler()) ]) # Categorical pipeline categorical_pipeline = Pipeline([ ("imputer", SimpleImputer(strategy="most_frequent")), ("encoder", OneHotEncoder(handle_unknown="ignore")) ]) # ColumnTransformer as a modular step preprocessor = ColumnTransformer([ ("num", numeric_pipeline, numeric_features), ("cat", categorical_pipeline, categorical_features) ]) # Final workflow pipeline: nest the ColumnTransformer as a step workflow = Pipeline([ ("preprocessing", preprocessor), ("classifier", LogisticRegression()) ]) # Fit the pipeline workflow.fit(data, [0, 1, 0, 1, 0])
copy

Modular composition using pipelines, ColumnTransformer, and parallel transformers like FeatureUnion brings several benefits to your workflow. By encapsulating each step, you make your code more maintainable—each part can be adjusted or replaced independently without rewriting the entire workflow. This modularity also promotes code reuse, as you can share or reuse pipelines across different projects or experiments. Furthermore, such structure ensures reproducibility: every transformation and estimator is defined within the pipeline, so running the same pipeline on the same data always yields the same result, provided random states are controlled. This makes it easier to track, debug, and share your experiments with others.

question mark

Which of the following are benefits of modular composition using pipelines, ColumnTransformer, and parallel transformers like FeatureUnion in scikit-learn workflows?

Select the correct answer

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 3. Luku 3

Kysy tekoälyä

expand

Kysy tekoälyä

ChatGPT

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Suggested prompts:

Can you explain how FeatureUnion works with an example?

What are some best practices for organizing complex pipelines?

How can I debug issues in a nested pipeline?

bookComposing Complex Workflows

Pyyhkäise näyttääksesi valikon

When constructing advanced machine learning workflows, you often need to combine multiple preprocessing steps, feature engineering, and modeling in a flexible yet organized manner. scikit-learn enables this by allowing you to nest pipelines within each other and to use composition tools like FeatureUnion for parallel transformations. Nesting pipelines means you can create a pipeline as a step inside another pipeline, or use a ColumnTransformer as a step, further enhancing modularity. With FeatureUnion, you can apply several transformers to the same data in parallel and concatenate their outputs, which is especially useful for combining different feature extraction or engineering strategies.

1234567891011121314151617181920212223242526272829303132333435363738394041424344
import pandas as pd from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.linear_model import LogisticRegression # Example data data = pd.DataFrame({ "age": [25, 32, 47, 51, None], "income": [50000, 60000, 80000, None, 52000], "gender": ["M", "F", "F", "M", "F"] }) # Numeric and categorical feature lists numeric_features = ["age", "income"] categorical_features = ["gender"] # Numeric pipeline numeric_pipeline = Pipeline([ ("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler()) ]) # Categorical pipeline categorical_pipeline = Pipeline([ ("imputer", SimpleImputer(strategy="most_frequent")), ("encoder", OneHotEncoder(handle_unknown="ignore")) ]) # ColumnTransformer as a modular step preprocessor = ColumnTransformer([ ("num", numeric_pipeline, numeric_features), ("cat", categorical_pipeline, categorical_features) ]) # Final workflow pipeline: nest the ColumnTransformer as a step workflow = Pipeline([ ("preprocessing", preprocessor), ("classifier", LogisticRegression()) ]) # Fit the pipeline workflow.fit(data, [0, 1, 0, 1, 0])
copy

Modular composition using pipelines, ColumnTransformer, and parallel transformers like FeatureUnion brings several benefits to your workflow. By encapsulating each step, you make your code more maintainable—each part can be adjusted or replaced independently without rewriting the entire workflow. This modularity also promotes code reuse, as you can share or reuse pipelines across different projects or experiments. Furthermore, such structure ensures reproducibility: every transformation and estimator is defined within the pipeline, so running the same pipeline on the same data always yields the same result, provided random states are controlled. This makes it easier to track, debug, and share your experiments with others.

question mark

Which of the following are benefits of modular composition using pipelines, ColumnTransformer, and parallel transformers like FeatureUnion in scikit-learn workflows?

Select the correct answer

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 3. Luku 3
some-alt