Impara Anomaly Detection in EDA

Scorri per mostrare il menu

In exploratory data analysis (EDA), you often encounter anomalies and outliers—data points that differ significantly from most of your dataset.

Anomalies are values that stand out because they do not follow the general pattern. These can indicate:
- Errors in data collection;
- Rare events;
- Important variations that need further investigation.
Outliers are a specific type of anomaly. They are unusually high or low values in a numerical feature compared to the rest of the data.

Detecting anomalies and outliers is essential because they can:

Skew summary statistics;
Distort visualization patterns;
Lead to misleading conclusions if not addressed.

Recognizing and interpreting anomalies helps you maintain data quality and make informed decisions about cleaning or exploring your data further.


              12345678910111213141516171819
            
import pandas as pd

# Sample data
data = {'score': [55, 60, 62, 58, 59, 97, 61, 57, 60, 58, 59, 61, 4]}
df = pd.DataFrame(data)

# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df['score'].quantile(0.25)
Q3 = df['score'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = df[(df['score'] < lower_bound) | (df['score'] > upper_bound)]
print("Outliers detected:")
print(outliers)


              123456789101112131415
            
import matplotlib.pyplot as plt
import seaborn as sns

# Visualize with boxplot
plt.figure(figsize=(8, 2))
sns.boxplot(x=df['score'], color='skyblue')

# Highlight outliers
for outlier in outliers['score']:
    plt.scatter(outlier, 0, color='red', s=100, label='Anomaly' if outlier == outliers['score'].iloc[0] else "")

plt.title('Boxplot of Scores with Outliers Highlighted')
plt.xlabel('Score')
plt.legend()
plt.show()

When you detect anomalies or outliers in your data, you have several strategies for handling them:

Investigate and correct data entry errors;
Remove outliers if they result from mistakes;
Keep outliers if they represent valid but rare events;
Transform values (such as applying log transformations) to reduce their impact.

The approach you choose affects your analysis. Removing outliers can make patterns clearer and summary statistics more representative; ignoring meaningful anomalies may hide important insights. Always consider the context and potential consequences before deciding how to handle anomalies in your EDA process.

Tutto è chiaro?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 24

Chieda ad AI

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 1. Capitolo 24