Outliers | Preprocessing Data: Part II

Course Content

# Data Manipulation using pandas

Data Manipulation using pandas

## Outliers

Another issue that can occur while working with data is outliers. Outliers are extremely high or low values related to others. For instance, the previous example with ages fits here too - value `1986` is significantly greater than the remaining ones `30, 16, 46, 23`. Or you may have people' heights in centimeters: `187, 165, 196, 178, 1.82, 180, 35435`. Values `1.82` and `35435` can be considered as outliers. Knowing specificity of the dataset you may guess that `1.82` is most likely `182`cm, but what with `35435`? Seems like you have nothing to do with that value.

There are several ways on dealing with outliers. The first and the easiest is to remove them. But you should be careful as usual not to remove a big share of data. Another solution is to replace outliers with some values - these can be either mean, or median, or 1st/3rd quartile, or some constant. It's up to you to choose the appropriate method. For instance, let's analyze some of numerical columns (`'empinch', 'invsth', 'govinch', 'otinch', 'totinch'`). These columns represent different source income with the last column represents total household income.

What do 25%, 50%, and 75% values mean? The first and the last ones are the first and third quartiles respectively. These are such values, that 25% and 75% of values respectively will be less than the quartile. 50% also called as median, it is the middle value of variation series (values arranged in order of increasing value). As you can see, minimal values of `'empinch'` and `'invsth'` columns are negative. This fact isn't good, since both columns represent income. Surely it can't be negative.

Everything was clear?

Section 2. Chapter 4

Course Content

# Data Manipulation using pandas

Data Manipulation using pandas

## Outliers

Another issue that can occur while working with data is outliers. Outliers are extremely high or low values related to others. For instance, the previous example with ages fits here too - value `1986` is significantly greater than the remaining ones `30, 16, 46, 23`. Or you may have people' heights in centimeters: `187, 165, 196, 178, 1.82, 180, 35435`. Values `1.82` and `35435` can be considered as outliers. Knowing specificity of the dataset you may guess that `1.82` is most likely `182`cm, but what with `35435`? Seems like you have nothing to do with that value.

There are several ways on dealing with outliers. The first and the easiest is to remove them. But you should be careful as usual not to remove a big share of data. Another solution is to replace outliers with some values - these can be either mean, or median, or 1st/3rd quartile, or some constant. It's up to you to choose the appropriate method. For instance, let's analyze some of numerical columns (`'empinch', 'invsth', 'govinch', 'otinch', 'totinch'`). These columns represent different source income with the last column represents total household income.

What do 25%, 50%, and 75% values mean? The first and the last ones are the first and third quartiles respectively. These are such values, that 25% and 75% of values respectively will be less than the quartile. 50% also called as median, it is the middle value of variation series (values arranged in order of increasing value). As you can see, minimal values of `'empinch'` and `'invsth'` columns are negative. This fact isn't good, since both columns represent income. Surely it can't be negative.

Everything was clear?

Section 2. Chapter 4