Removing Outliers Using Z-Score Method
One common method for detecting and removing outliers is the z-score method. This technique identifies how far a data point is from the mean in terms of standard deviations. If a data point lies beyond a certain threshold (commonly Β±3), it is considered an outlier.
What Is a Z-Score?
A z-score (also known as a standard score) is calculated using the formula:
Z=ΟXβΞΌβWhere:
- X: the original data point;
- ΞΌ: the mean of the dataset;
- Ο: the standard deviation of the dataset.
Calculating Z-Scores
You can either compute z-scores manually by following the formula:
mean_cgpa <- mean(df$cgpa)
sd_cgpa <- sd(df$cgpa)
df$cgpa_zscore <- (df$cgpa - mean_cgpa) / sd_cgpa
Or you can use the built-in function:
df$cgpa_zscore <- scale(df$cgpa)
Identifying Outliers
After calculating the z-scores, you can choose a threshold (Β±3 in this case) and apply a simple filtering operation to select all entries outside of the range:
thresh_hold <- 3
outliers <- df[df$cgpa_zscore > thresh_hold | df$cgpa_zscore < -thresh_hold, ]
Or you can select all entries inside the range to create an outlier-free dataset:
df2 <- df[df$cgpa_zscore < thresh_hold & df$cgpa_zscore > -thresh_hold, ]
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 4
Removing Outliers Using Z-Score Method
Swipe to show menu
One common method for detecting and removing outliers is the z-score method. This technique identifies how far a data point is from the mean in terms of standard deviations. If a data point lies beyond a certain threshold (commonly Β±3), it is considered an outlier.
What Is a Z-Score?
A z-score (also known as a standard score) is calculated using the formula:
Z=ΟXβΞΌβWhere:
- X: the original data point;
- ΞΌ: the mean of the dataset;
- Ο: the standard deviation of the dataset.
Calculating Z-Scores
You can either compute z-scores manually by following the formula:
mean_cgpa <- mean(df$cgpa)
sd_cgpa <- sd(df$cgpa)
df$cgpa_zscore <- (df$cgpa - mean_cgpa) / sd_cgpa
Or you can use the built-in function:
df$cgpa_zscore <- scale(df$cgpa)
Identifying Outliers
After calculating the z-scores, you can choose a threshold (Β±3 in this case) and apply a simple filtering operation to select all entries outside of the range:
thresh_hold <- 3
outliers <- df[df$cgpa_zscore > thresh_hold | df$cgpa_zscore < -thresh_hold, ]
Or you can select all entries inside the range to create an outlier-free dataset:
df2 <- df[df$cgpa_zscore < thresh_hold & df$cgpa_zscore > -thresh_hold, ]
Thanks for your feedback!