Data Manipulation using pandas
Another issue that can occur while working with data is outliers. Outliers are extremely high or low values related to others. For instance, the previous example with ages fits here too - value
1986 is significantly greater than the remaining ones
30, 16, 46, 23. Or you may have people' heights in centimeters:
187, 165, 196, 178, 1.82, 180, 35435. Values
35435 can be considered as outliers. Knowing specificity of the dataset you may guess that
1.82 is most likely
182cm, but what with
35435? Seems like you have nothing to do with that value.
There are several ways on dealing with outliers. The first and the easiest is to remove them. But you should be careful as usual not to remove a big share of data. Another solution is to replace outliers with some values - these can be either mean, or median, or 1st/3rd quartile, or some constant. It's up to you to choose the appropriate method. For instance, let's analyze some of numerical columns (
'empinch', 'invsth', 'govinch', 'otinch', 'totinch'). These columns represent different source income with the last column represents total household income.
What do 25%, 50%, and 75% values mean? The first and the last ones are the first and third quartiles respectively. These are such values, that 25% and 75% of values respectively will be less than the quartile. 50% also called as median, it is the middle value of variation series (values arranged in order of increasing value). As you can see, minimal values of
'invsth' columns are negative. This fact isn't good, since both columns represent income. Surely it can't be negative.