学ぶオーバーサンプリング手法 | 大規模データのサンプリング手法

メニューを表示するにはスワイプしてください

オーバーサンプリングは、不均衡なデータセットの問題に対処するための手法であり、特にあるクラス（マイノリティクラス）のサンプル数が他のクラスに比べて著しく少ない場合に用いられます。マイノリティクラスの代表数を増やすことで、機械学習モデルがすべてのクラスからより効果的に学習できるようになり、予測性能の向上や公平な結果につながることが多くなります。オーバーサンプリングの最も一般的な利点は、クラス分布のバランスを取ることで、アルゴリズムがマジョリティクラスに偏るのを防ぐ点です。しかし、オーバーサンプリングには注意点もあります。既存のサンプルを単純に複製すると、モデルが複製データに特化しすぎてしまい、新しいデータへの汎化能力が低下する、いわゆる過学習が発生する可能性があります。また、オーバーサンプリングによってデータセットのサイズが増加し、学習時間の延長や計算負荷の増大につながることもあります。


              1234567891011121314151617181920212223242526272829303132
            
import pandas as pd

# Create a sample DataFrame with an imbalanced target
data = {
    "feature1": [1, 2, 3, 4, 5, 6, 7],
    "target":   ["A", "A", "A", "A", "B", "B", "B"]
}
df = pd.DataFrame(data)

# Count original class distribution
print("Original class distribution:")
print(df["target"].value_counts())

# Oversample minority class "B" to match majority class "A"
majority_count = df["target"].value_counts().max()
minority_class = df["target"].value_counts().idxmin()

# Get all minority class rows
minority_rows = df[df["target"] == minority_class]

# Calculate how many samples to add
samples_to_add = majority_count - len(minority_rows)

# Sample with replacement from minority class
oversampled_minority = minority_rows.sample(n=samples_to_add, replace=True, random_state=42)

# Concatenate original data with new samples
df_oversampled = pd.concat([df, oversampled_minority], ignore_index=True)

# Show new class distribution
print("\nClass distribution after oversampling:")
print(df_oversampled["target"].value_counts())

すべて明確でしたか？

フィードバックありがとうございます！

セクション 2. 章 3

AIに質問する

何でも質問するか、提案された質問の1つを試してチャットを始めてください

セクション 2. 章 3