Outliers

Another issue that can occur while working with data is outliers. Outliers are extremely high or low values related to others. For instance, the previous example with ages fits here too - value 1986 is significantly greater than the remaining ones 30, 16, 46, 23. Or you may have people' heights in centimeters: 187, 165, 196, 178, 1.82, 180, 35435. Values 1.82 and 35435 can be considered as outliers. Knowing specificity of the dataset you may guess that 1.82 is most likely 182cm, but what with 35435? Seems like you have nothing to do with that value.

There are several ways on dealing with outliers. The first and the easiest is to remove them. But you should be careful as usual not to remove a big share of data. Another solution is to replace outliers with some values - these can be either mean, or median, or 1st/3rd quartile, or some constant. It's up to you to choose the appropriate method. For instance, let's analyze some of numerical columns ('empinch', 'invsth', 'govinch', 'otinch', 'totinch'). These columns represent different source income with the last column represents total household income.


              1234567
            
# Importing the library
import pandas as pd

# Reading the file
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/f2947b09-5f0d-4ad9-992f-ec0b87cd4b3f/data3.csv')
# Summary of certain columns
print(df[['empinch', 'invsth', 'govinch', 'otinch', 'totinch']].describe())

What do 25%, 50%, and 75% values mean? The first and the last ones are the first and third quartiles respectively. These are such values, that 25% and 75% of values respectively will be less than the quartile. 50% also called as median, it is the middle value of variation series (values arranged in order of increasing value). As you can see, minimal values of 'empinch' and 'invsth' columns are negative. This fact isn't good, since both columns represent income. Surely it can't be negative.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 4

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Data Manipulation using pandas

1. Preprocessing Data: Part I

What is Data Preprocessing?Types consistency Poor Data Presentation Manipulating Strings Challenge Replacing Specific Elements Simultaneous Replacement Challenge

2. Preprocessing Data: Part II

Logical Inconsistency Removing Rows Challenge Outliers Challenge Missing Values Filling NA values Challenge

3. Grouping Data

What is Grouping Data?Grouping in pandas [1/2]Challenge Grouping in pandas [2/2]Challenge Grouping by Several Columns Challenge

4. Aggregating and Visualizing Data

Advanced Aggregation [1/2]Challenge Advanced Aggregation [2/2]Challenge Histograms Challenge Bar and Scatter Plots Other Types of Graphs Challenge 1 Challenge 2

5. Joining Data

What is Joining Data?Left Join Right Join Inner Join Outer Join Concatenation

Outliers


              1234567
            
# Importing the library
import pandas as pd

# Reading the file
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/f2947b09-5f0d-4ad9-992f-ec0b87cd4b3f/data3.csv')
# Summary of certain columns
print(df[['empinch', 'invsth', 'govinch', 'otinch', 'totinch']].describe())

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 4