Removing Outliers

Outliers are data points that are significantly different from the other data points in a dataset. Why is it important to deal with them? Outliers can occur due to measurement errors, data entry errors, or other factors and can significantly impact the data analysis.

Outliers can significantly impact statistical analysis, machine learning models, and data visualization. They can distort the results of statistical analysis, lead to biased machine learning models, and make it difficult to visualize the data accurately. Removing outliers can help improve the analysis's accuracy and reliability and improve the results' interpretability.

There are several ways to remove outliers in Python, but one common technique is the Z-score method:


              123456789101112131415
            
import numpy as np

# Generate small dataset
dataset = np.random.normal(0, 1, 1000)

# Calculate the Z-scores 
z_scores = (dataset - np.mean(dataset)) / np.std(dataset)

# Find the indices of the outliers
outlier_indices = np.where(np.abs(z_scores) > 3)[0]

# Print outliers
print('Outliers are: ', dataset[outlier_indices])
# Remove the outliers
filtered_data = np.delete(dataset, outlier_indices)

In this example, we first generate some sample data using the random.normal() method. We then calculate the Z-scores for the data by subtracting the mean and dividing by the standard deviation. We define outliers as any data point whose absolute Z-score is greater than 3 (a common threshold for identifying outliers). We find the indices of these outliers using the .where() method and then remove them from the original data using the .delete() method.

It should be clarified that this method only works for Gaussian data. If your data has non-symmetric distribution, then you can use a modified Z-score. The modified Z-score is calculated as the difference between a data point and the median, divided by the median absolute deviation.

It is also important to remember that not all outliers need to be removed because outliers can sometimes be a natural part of the data and provide important information about the underlying process or phenomenon being studied.

In some cases, outliers may represent rare or extreme events that are important to capture in the analysis. For example, in medical research, outliers in inpatient data may represent rare but important cases that need to be studied separately.

Furthermore, outliers can sometimes result from measurement errors or random fluctuations in the data. In such cases, removing all outliers may not be necessary or appropriate.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Data Preprocessing

1. Brief Introduction

Data Types Data Processing Methods Dataset: Test and Training Deleting an "Extra" Data Changing the Data Type

2. Processing Quantitative Data

Data Scaling Data Scaling vs Data Normalization Removing Outliers Removing Missing Values Data Augmentation: Synthetic Data

3. Processing Categorical Data

Methods for Encoding the Categorical Data One-Hot Encoding Ordinal Encoding Label Encoding of the Target Variable Challenge

4. Time Series Data Processing

Data Type Conversion Data Cleaning Stationarity Denoising Train/Test Split & Cross Validation Challenge

5. Feature Engineering

Technique Idea Realization Feature Extraction from Text Feature Extraction from Images Feature Extraction from Time Series Challenge

6. Moving on to Tasks