学ぶパイプラインによる前処理の自動化

メニューを表示するにはスワイプしてください

scikit-learn パイプラインを使用した前処理および特徴量エンジニアリングの自動化により、一貫性があり再現性の高い機械学習結果を実現。パイプラインを利用することで、スケーリング、エンコーディング、特徴選択などのステップを連結し、すべての変換処理が常に同じ順序で実行される。

scikit-learn でパイプラインを構築するには、各ステップが一意のステップ名（文字列）とトランスフォーマオブジェクト（例：StandardScaler や SelectKBest）からなるタプルのリストを定義する。例：

steps = [
    ("scaler", StandardScaler()),
    ("feature_selection", SelectKBest(score_func=f_classif, k=2))
]

このリストを Pipeline オブジェクトに渡す：

pipeline = Pipeline(steps)

パイプラインは各トランスフォーマを順番に適用し、前のステップの出力を次のステップの入力として渡す。このアプローチにより、時間の節約だけでなくデータリークのリスクも低減し、実験の信頼性と再現性が向上。

ColumnTransformer を用いた特徴量サブセットへの適用

ColumnTransformer を使うことで、データ内の異なる特徴量サブセットごとに異なる前処理パイプラインを適用可能。例：

# Define column types
numeric_features = ['age', 'fare']
categorical_features = ['embarked', 'sex']

# Preprocessing for numeric features: impute missing values and scale
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features: impute missing values and encode
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

これにより、数値データとカテゴリデータの両方を正しく処理する統一されたパイプラインを構築でき、前処理コードの整理と各変換処理の対象列の明確化が可能。


              1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
            
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectKBest, f_classif

# Load the Titanic dataset from seaborn (no warnings!)
df = sns.load_dataset('titanic')

# Select features and target
features = ['age', 'fare', 'embarked', 'sex']
X = df[features]
y = df['survived']  # Target variable

# Define column types
numeric_features = ['age', 'fare']
categorical_features = ['embarked', 'sex']

# Preprocessing for numeric features: impute missing values and scale
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features: impute missing values and encode
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Build the full pipeline with preprocessing and feature selection
pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('feature_selection', SelectKBest(score_func=f_classif, k=3))
])

# Fit and transform the data
X_transformed = pipeline.fit_transform(X, y)

print(f"Original shape: {X.shape}")
print(f"Reduced from {X.shape[1]} features to {X_transformed.shape[1]} selected features")

Note

前処理をトレーニングパイプラインに統合することで、変換処理の一貫性が保たれ、トレーニングおよび予測時のデータリーク防止に寄与。

すべて明確でしたか？

フィードバックありがとうございます！

セクション 1. 章 11

AIに質問する

何でも質問するか、提案された質問の1つを試してチャットを始めてください

セクション 1. 章 11