Oppiskele Composing Complex Workflows | Pipelines and Composition Patterns

Pyyhkäise näyttääksesi valikon

When constructing advanced machine learning workflows, you often need to combine multiple preprocessing steps, feature engineering, and modeling in a flexible yet organized manner. scikit-learn enables this by allowing you to nest pipelines within each other and to use composition tools like FeatureUnion for parallel transformations. Nesting pipelines means you can create a pipeline as a step inside another pipeline, or use a ColumnTransformer as a step, further enhancing modularity. With FeatureUnion, you can apply several transformers to the same data in parallel and concatenate their outputs, which is especially useful for combining different feature extraction or engineering strategies.


              1234567891011121314151617181920212223242526272829303132333435363738394041424344
            
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

# Example data
data = pd.DataFrame({
    "age": [25, 32, 47, 51, None],
    "income": [50000, 60000, 80000, None, 52000],
    "gender": ["M", "F", "F", "M", "F"]
})

# Numeric and categorical feature lists
numeric_features = ["age", "income"]
categorical_features = ["gender"]

# Numeric pipeline
numeric_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])

# Categorical pipeline
categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

# ColumnTransformer as a modular step
preprocessor = ColumnTransformer([
    ("num", numeric_pipeline, numeric_features),
    ("cat", categorical_pipeline, categorical_features)
])

# Final workflow pipeline: nest the ColumnTransformer as a step
workflow = Pipeline([
    ("preprocessing", preprocessor),
    ("classifier", LogisticRegression())
])

# Fit the pipeline
workflow.fit(data, [0, 1, 0, 1, 0])

Modular composition using pipelines, ColumnTransformer, and parallel transformers like FeatureUnion brings several benefits to your workflow. By encapsulating each step, you make your code more maintainable—each part can be adjusted or replaced independently without rewriting the entire workflow. This modularity also promotes code reuse, as you can share or reuse pipelines across different projects or experiments. Furthermore, such structure ensures reproducibility: every transformation and estimator is defined within the pipeline, so running the same pipeline on the same data always yields the same result, provided random states are controlled. This makes it easier to track, debug, and share your experiments with others.

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 3. Luku 3

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Osio 3. Luku 3