Contenido del Curso
Data Manipulation using pandas
Data Manipulation using pandas
Outliers
Another issue that can occur while working with data is outliers. Outliers are extremely high or low values related to others. For instance, the previous example with ages fits here too - value 1986
is significantly greater than the remaining ones 30, 16, 46, 23
. Or you may have people' heights in centimeters: 187, 165, 196, 178, 1.82, 180, 35435
. Values 1.82
and 35435
can be considered as outliers. Knowing specificity of the dataset you may guess that 1.82
is most likely 182
cm, but what with 35435
? Seems like you have nothing to do with that value.
There are several ways on dealing with outliers. The first and the easiest is to remove them. But you should be careful as usual not to remove a big share of data. Another solution is to replace outliers with some values - these can be either mean, or median, or 1st/3rd quartile, or some constant. It's up to you to choose the appropriate method. For instance, let's analyze some of numerical columns ('empinch', 'invsth', 'govinch', 'otinch', 'totinch'
). These columns represent different source income with the last column represents total household income.
# Importing the library import pandas as pd # Reading the file df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/f2947b09-5f0d-4ad9-992f-ec0b87cd4b3f/data3.csv') # Summary of certain columns print(df[['empinch', 'invsth', 'govinch', 'otinch', 'totinch']].describe())
What do 25%, 50%, and 75% values mean? The first and the last ones are the first and third quartiles respectively. These are such values, that 25% and 75% of values respectively will be less than the quartile. 50% also called as median, it is the middle value of variation series (values arranged in order of increasing value). As you can see, minimal values of 'empinch'
and 'invsth'
columns are negative. This fact isn't good, since both columns represent income. Surely it can't be negative.
¡Gracias por tus comentarios!