メニューを表示するにはスワイプしてください

scikit-learn パイプラインを使用した前処理および特徴量エンジニアリングの自動化により、一貫性があり再現性の高い機械学習結果を実現。パイプラインを利用することで、スケーリング、エンコーディング、特徴量選択などの処理を連鎖させ、すべての変換が常に同じ順序で実行される。

scikit-learn でパイプラインを構築するには、各ステップを一意のステップ名（文字列）とトランスフォーマオブジェクト（例：StandardScaler や SelectKBest）のタプルとしてリストで定義。

steps = [
    ("scaler", StandardScaler()),
    ("feature_selection", SelectKBest(score_func=f_classif, k=2))
]

このリストを Pipeline オブジェクトに渡す。

pipeline = Pipeline(steps)

パイプラインは各トランスフォーマを順番に適用し、前のステップの出力を次のステップの入力として渡す。この方法により、作業時間の短縮だけでなく、データリークのリスクも低減し、実験の信頼性と再現性が向上。

ColumnTransformer を用いた特徴量サブセットへの適用

ColumnTransformer を使うことで、データ内の異なる特徴量サブセットごとに異なる前処理パイプラインを適用可能。例：

# Define column types
numeric_features = ['age', 'fare']
categorical_features = ['embarked', 'sex']

# Preprocessing for numeric features: impute missing values and scale
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features: impute missing values and encode
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

これにより、数値データとカテゴリデータの両方を適切に処理する統一パイプラインを構築でき、前処理コードの整理と各変換の対象カラムへの正確な適用が可能。


              1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
            
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectKBest, f_classif

# Load the Titanic dataset from seaborn (no warnings!)
df = sns.load_dataset('titanic')

# Select features and target
features = ['age', 'fare', 'embarked', 'sex']
X = df[features]
y = df['survived']  # Target variable

# Define column types
numeric_features = ['age', 'fare']
categorical_features = ['embarked', 'sex']

# Preprocessing for numeric features: impute missing values and scale
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features: impute missing values and encode
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Build the full pipeline with preprocessing and feature selection
pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('feature_selection', SelectKBest(score_func=f_classif, k=3))
])

# Fit and transform the data
X_transformed = pipeline.fit_transform(X, y)

print(f"Original shape: {X.shape}")
print(f"Reduced from {X.shape[1]} features to {X_transformed.shape[1]} selected features")

ノート

前処理をトレーニングパイプラインに統合することで、変換の一貫性が保たれ、トレーニングおよび予測時のデータリーク防止に寄与。

すべて明確でしたか？

フィードバックありがとうございます！

セクション 3. 章 3

AIに質問する

何でも質問するか、提案された質問の1つを試してチャットを始めてください

パイプラインによる前処理の自動化

steps = [
    ("scaler", StandardScaler()),
    ("feature_selection", SelectKBest(score_func=f_classif, k=2))
]