メニューを表示するにはスワイプしてください

実世界のデータセットを扱う際には、重複レコードや外れ値に頻繁に遭遇します。これらはデータ分析や機械学習モデルの性能に大きな影響を与える可能性があります。重複データは特定のパターンの重要性を不自然に高め、結果に偏りを生じさせる一方、外れ値は統計的な要約やモデル予測を歪めることがあります。これらの問題を適切に特定し対処することは、データクリーニングの重要な要素です。


              1234567891011121314151617
            
import pandas as pd
import seaborn as sns

# Load the Titanic dataset from seaborn
df = sns.load_dataset("titanic")

# Find duplicate rows in the Titanic dataset
duplicates = df.duplicated()
print("Duplicate row indicators:")
print(duplicates.value_counts())  # Show how many duplicates exist

# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
print("\nNumber of rows before removing duplicates:")
print(len(df))
print("Number of rows after removing duplicates:")
print(len(df_no_duplicates))

定義

外れ値とは、データセットの大多数から大きく逸脱したデータポイントを指します。外れ値の検出によく用いられる手法には、可視化（ボックスプロットなど）、統計的指標（Z-score など）、および四分位範囲（IQR）法があります。

Zスコアと**四分位範囲（IQR）**は、データセット内の外れ値を特定するためによく使われる2つの統計指標：

Zスコア：
- データポイントが平均からどれだけ標準偏差離れているかを測定；
- Zスコアは次の式で計算：(value - mean) / standard deviation；
- Zスコアが3より大きい、または-3より小さいデータポイントは、平均値から大きく離れているため、しばしば外れ値と見なされる。
四分位範囲（IQR）：
- 第一四分位数（Q1、25パーセンタイル）と第三四分位数（Q3、75パーセンタイル）の間の範囲を表す；
- IQRはQ3 - Q1で計算；
- 外れ値は通常、Q1 - 1.5 * IQR未満またはQ3 + 1.5 * IQRを超えるデータポイントとして定義され、これはデータの中央50%の典型的な範囲外にあることを意味する。

どちらの方法も、値が期待される範囲からどれだけ逸脱しているかを測定するのに役立つ。Zスコアは平均からの距離に着目し、IQRはデータセットの中央部分から外れた値を特定する。


              12345678910111213141516171819202122
            
import seaborn as sns
import pandas as pd

# Load the Titanic dataset from seaborn
df = sns.load_dataset("titanic")

# Drop rows with missing 'fare' values
df_fare = df.dropna(subset=["fare"])

# Calculate Q1 and Q3 for the 'fare' column
Q1 = df_fare["fare"].quantile(0.25)
Q3 = df_fare["fare"].quantile(0.75)
IQR = Q3 - Q1

# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Detect outliers in 'fare'
outliers = df_fare[(df_fare["fare"] < lower_bound) | (df_fare["fare"] > upper_bound)]
print("Outliers detected in 'fare' using IQR method:")
print(outliers[["fare"]])

注意

外れ値を扱う際には、それらを削除するか、極端な値を上限・下限で制限したり、log変換を適用するなどして変換する方法がある。最適な方法は、データセットや分析の目的によって異なる。

すべて明確でしたか？

フィードバックありがとうございます！

セクション 1. 章 3

AIに質問する

何でも質問するか、提案された質問の1つを試してチャットを始めてください

重複データと外れ値の処理


              1234567891011121314151617
            
import pandas as pd
import seaborn as sns

# Load the Titanic dataset from seaborn
df = sns.load_dataset("titanic")

# Find duplicate rows in the Titanic dataset
duplicates = df.duplicated()
print("Duplicate row indicators:")
print(duplicates.value_counts())  # Show how many duplicates exist

# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
print("\nNumber of rows before removing duplicates:")
print(len(df))
print("Number of rows after removing duplicates:")
print(len(df_no_duplicates))

定義

Zスコアと**四分位範囲（IQR）**は、データセット内の外れ値を特定するためによく使われる2つの統計指標：

Zスコア：
- データポイントが平均からどれだけ標準偏差離れているかを測定；
- Zスコアは次の式で計算：(value - mean) / standard deviation；
- Zスコアが3より大きい、または-3より小さいデータポイントは、平均値から大きく離れているため、しばしば外れ値と見なされる。
四分位範囲（IQR）：
- 第一四分位数（Q1、25パーセンタイル）と第三四分位数（Q3、75パーセンタイル）の間の範囲を表す；
- IQRはQ3 - Q1で計算；
- 外れ値は通常、Q1 - 1.5 * IQR未満またはQ3 + 1.5 * IQRを超えるデータポイントとして定義され、これはデータの中央50%の典型的な範囲外にあることを意味する。


              12345678910111213141516171819202122
            
import seaborn as sns
import pandas as pd

# Load the Titanic dataset from seaborn
df = sns.load_dataset("titanic")

# Drop rows with missing 'fare' values
df_fare = df.dropna(subset=["fare"])

# Calculate Q1 and Q3 for the 'fare' column
Q1 = df_fare["fare"].quantile(0.25)
Q3 = df_fare["fare"].quantile(0.75)
IQR = Q3 - Q1

# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Detect outliers in 'fare'
outliers = df_fare[(df_fare["fare"] < lower_bound) | (df_fare["fare"] > upper_bound)]
print("Outliers detected in 'fare' using IQR method:")
print(outliers[["fare"]])

注意

すべて明確でしたか？

フィードバックありがとうございます！

セクション 1. 章 3