Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Outliers | Preprocessing Data: Part II
Data Manipulation using pandas
course content

Course Content

Data Manipulation using pandas

Data Manipulation using pandas

1. Preprocessing Data: Part I
2. Preprocessing Data: Part II
3. Grouping Data
4. Aggregating and Visualizing Data
5. Joining Data

bookOutliers

Another issue that can occur while working with data is outliers. Outliers are extremely high or low values related to others. For instance, the previous example with ages fits here too - value 1986 is significantly greater than the remaining ones 30, 16, 46, 23. Or you may have people' heights in centimeters: 187, 165, 196, 178, 1.82, 180, 35435. Values 1.82 and 35435 can be considered as outliers. Knowing specificity of the dataset you may guess that 1.82 is most likely 182cm, but what with 35435? Seems like you have nothing to do with that value.

There are several ways on dealing with outliers. The first and the easiest is to remove them. But you should be careful as usual not to remove a big share of data. Another solution is to replace outliers with some values - these can be either mean, or median, or 1st/3rd quartile, or some constant. It's up to you to choose the appropriate method. For instance, let's analyze some of numerical columns ('empinch', 'invsth', 'govinch', 'otinch', 'totinch'). These columns represent different source income with the last column represents total household income.

1234567
# Importing the library import pandas as pd # Reading the file df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/f2947b09-5f0d-4ad9-992f-ec0b87cd4b3f/data3.csv') # Summary of certain columns print(df[['empinch', 'invsth', 'govinch', 'otinch', 'totinch']].describe())
copy

What do 25%, 50%, and 75% values mean? The first and the last ones are the first and third quartiles respectively. These are such values, that 25% and 75% of values respectively will be less than the quartile. 50% also called as median, it is the middle value of variation series (values arranged in order of increasing value). As you can see, minimal values of 'empinch' and 'invsth' columns are negative. This fact isn't good, since both columns represent income. Surely it can't be negative.

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 2. Chapter 4
We're sorry to hear that something went wrong. What happened?
some-alt