Contenido del Curso
Data Manipulation using pandas
Data Manipulation using pandas
Logical Inconsistency
Another type of issues you may have with data is logical inconsistencies. For instance, assume you have column with total values, that considered to be the sum of some other columns. But the sum doesn't match the respective values. Or assume you have people' ages 30, 16, 46, 23
, and the next value is 1986
. Most likely person answered birth year instead of age.
Despite the issue origin, you need to decide whether you want to save all the observations by updating the problematic ones, or remove it. For instance, in the last example you can calculate how old is the person born in 1986, but if the total values don't match respective values, then you may consider removing them. If you want to remove some data, you need to make sure that you delete insignificant share, distibuted uniformly.
For example, there are 'hhpera', 'hhperb1', 'hhperb2', 'hhperd1', 'hhperd2', 'hhpere1', 'hhpere2', 'hhperf1', 'hhperf2', 'hhperg1', 'hhperg2', 'hhperh1', 'hhperh2'
columns, each of them representing number of males/females in a specific age range. Also there is a column 'hhsize'
representing the total number of people in household. Obviously, the sum of 'hhpera'
- 'hhperg2'
must be equal to 'hhsize'
. To calculate sum accross rows, set the axis
parameter to 1
. Columns described previously have indexes 2-14.
# Importing the library import pandas as pd # Reading the file df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/f2947b09-5f0d-4ad9-992f-ec0b87cd4b3f/data2.csv') # Calculate total of columns print(df.iloc[:,2:15].sum(axis = 1))
Now let's compare these values with the 'hhsize'
column values. The result will be either True
(if matches), or False
(otherwise). If we sum those logical values, we will get the number of rows where sums match.
# Importing the library import pandas as pd # Reading the file df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/f2947b09-5f0d-4ad9-992f-ec0b87cd4b3f/data2.csv') # Calculate total of columns print(sum(df.iloc[:,2:15].sum(axis = 1) == df.hhsize))
As you can see, there are 1019 observations where totals match values of respective column, which means there are 5 rows (since total number of rows is 1024) that don't match.
¡Gracias por tus comentarios!