ML Introduction with scikit-learn

Machine learning is now used everywhere. Want to learn it yourself? This course is an introduction to the world of Machine learning for you to learn basic concepts, work with Scikit-learn – the most popular library for ML and build your first Machine Learning project. This course is intended for students with a basic knowledge of Python, Pandas, and Numpy.

python

4.6

Data ScienceData Analytics

Anomaly Detection Techniques with Python

Exploring the Unseen: Mastering Anomaly Detection in Data

by Kyryl Sidak

Data Scientist, ML Engineer

Dec, 2023・
8 min read

Anomaly Detection Techniques with Python

In the expansive world of data science, anomaly detection stands out as a crucial process, essential for uncovering insights that deviate from the expected norms. Often these insights hold the key to understanding complex phenomena across various domains, from cybersecurity and finance to healthcare and environmental studies. This comprehensive guide ventures into the realm of anomaly detection in Python, a powerful and versatile language favored by data scientists and analysts.

Understanding Anomalies

Anomalies, commonly known as outliers, are data points that differ significantly from the majority of a dataset. These anomalies could be the result of a variety of factors such as experimental error, fraudulent activity, or a novel discovery that challenges existing norms.

There are three main types of anomalies:

Point Anomalies: Individual data points that are abnormal compared to the rest.
Contextual Anomalies: Anomalies that are context-specific and may not be outliers in a different setting.
Collective Anomalies: Groups of data points that collectively deviate from the norm, even if the individual points might not be anomalous.

Run Code from Your Browser - No Installation Required

Statistical Methods

Statistical methods are foundational in anomaly detection, relying on the assumption that normal data follows a well-defined distribution. There two most common statistical methods:

Z-Score Analysis;
IQR (Interquartile Range) Method.

Z-Score Analysis

This method identifies outliers by measuring the number of standard deviations a data point is from the mean. It's effective for data that follows a Gaussian distribution: data points which are more than 3 standard deviations away from the mean are considered outliers. This stems from the fact that in Gaussian distribution approximately 99.7% of the data points typically lie within the range of 3 standard deviations.

However, there is one disadvantage to this method, namely its reliance on mean and standard deviation makes it sensitive to the presence of outliers themselves.

# Example of Z-Score Analysis
import numpy as np
from scipy import stats
z_scores = np.abs(stats.zscore(data))
# data is a certain dataset represented as a pandas DataFrame
anomalies = data[(z_scores > 3).all(axis=1)]

IQR (Interquartile Range) Method

The Interquartile Range (IQR) method for anomaly detection is a robust and intuitive statistical technique, which is particularly effective in handling skewed distributions or when the data contains extreme values that could distort the mean and standard deviation.

Here's a more detailed explanation of the IQR method:

Understanding the IQR: The IQR represents the range within which the central 50% of the data lies. It is calculated by subtracting the 25th percentile (Q1) from the 75th percentile (Q3) of the data. This range effectively captures the "middle spread" of the data, avoiding the influence of extreme values at either end.
Identifying Outliers with the IQR: Anomalies are identified by determining how far a data point lies from this middle 50%. Typically, data points are considered outliers if they fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR. This rule is based on the assumption that in a normal distribution, most data (about 99.3%) should lie within these bounds.
Advantages Over Z-Score Method:
- Robustness: Unlike the Z-score, which relies on the mean and standard deviation, the IQR is less susceptible to being influenced by outliers. This makes the IQR method more reliable, particularly in datasets where outliers can significantly skew the mean and standard deviation.
- Applicability to Non-Normal Distributions: The IQR method does not assume that the data is normally distributed, making it more versatile for different types of data distributions.
Practical Implementation in Python:

# Example of IQR Method
# Calculate Q1 and Q3
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
# Calculate the IQR
IQR = Q3 - Q1
# Determine the outliers
outliers = data[(data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))]
# data is a certain dataset represented as a pandas DataFrame

Start Learning Coding today and boost your Career Potential

Machine Learning-Based Approaches

Machine learning methods offer dynamic and adaptable approaches to anomaly detection.

Isolation Forest: This algorithm isolates anomalies instead of profiling normal data points. It works well with high-dimensional data and is robust against outliers. It builds an ensemble of random decision trees, then isolates anomalies based on the number of splits required to isolate a sample.
```
# Example of Isolation Forest
from sklearn.ensemble import IsolationForest
iso_forest = IsolationForest(n_estimators=100)
iso_forest.fit(data)
# data is a certain dataset represented as a pandas DataFrame
predictions = iso_forest.predict(data)
```
One-Class SVM (Support Vector Machine): Designed for novelty detection, this method is useful when the dataset has more normal than anomalous instances. It learns a boundary around the normal data and identifies points outside this boundary as anomalies.
```
# Example of One-Class SVM
from sklearn.svm import OneClassSVM
oc_svm = OneClassSVM(kernel='rbf', gamma=0.1)
oc_svm.fit(data)
# data is a certain dataset represented as a pandas DataFrame
predictions = oc_svm.predict(data)
```

Proximity-Based Methods

These methods are based on the proximity or distance between data points. Here are the most commonly used ones:

k-Nearest Neighbors (k-NN): This algorithm detects anomalies by considering the distance of a point from its k-nearest neighbors. Points with a significantly higher average distance are marked as anomalies.

# Example of k-NN
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=20)
predictions = lof.fit_predict(data)
# data is a certain dataset represented as a pandas DataFrame

DBSCAN (Density-Based Spatial Clustering of Applications with Noise): DBSCAN clusters dense regions of data points and identifies points that do not belong to these clusters as outliers. It's effective for data with clusters of varying shapes and sizes.
```
# Example of DBSCAN
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=10)
dbscan.fit(data)
# data is a certain dataset represented as a pandas DataFrame
labels = dbscan.labels_
anomalies = data[labels == -1]
```

FAQs

Q: What are the prerequisites for learning anomaly detection in Python?
A: Basic knowledge of Python, statistics, and machine learning concepts is beneficial.

Q: Can anomaly detection be fully automated?
A: Yes, with proper setup and parameter tuning, anomaly detection can be automated for ongoing monitoring.

Q: How does anomaly detection aid in business intelligence?
A: It helps identify fraud, system failures, or market trends early, aiding in prompt decision-making.

Q: Is domain knowledge crucial in anomaly detection?
A: Yes, understanding the domain context is key to interpreting anomaly detection results accurately.

Q: Can these techniques be applied to any type of data?
A: While versatile, some techniques are better suited for specific types of data, such as time-series or spatial data.

¿Fue útil este artículo?