Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Logical Inconsistency | Preprocessing Data: Part II
Data Manipulation using pandas
course content

Conteúdo do Curso

Data Manipulation using pandas

Data Manipulation using pandas

1. Preprocessing Data: Part I
2. Preprocessing Data: Part II
3. Grouping Data
4. Aggregating and Visualizing Data
5. Joining Data

bookLogical Inconsistency

Another type of issues you may have with data is logical inconsistencies. For instance, assume you have column with total values, that considered to be the sum of some other columns. But the sum doesn't match the respective values. Or assume you have people' ages 30, 16, 46, 23, and the next value is 1986. Most likely person answered birth year instead of age.

Despite the issue origin, you need to decide whether you want to save all the observations by updating the problematic ones, or remove it. For instance, in the last example you can calculate how old is the person born in 1986, but if the total values don't match respective values, then you may consider removing them. If you want to remove some data, you need to make sure that you delete insignificant share, distibuted uniformly.

For example, there are 'hhpera', 'hhperb1', 'hhperb2', 'hhperd1', 'hhperd2', 'hhpere1', 'hhpere2', 'hhperf1', 'hhperf2', 'hhperg1', 'hhperg2', 'hhperh1', 'hhperh2' columns, each of them representing number of males/females in a specific age range. Also there is a column 'hhsize' representing the total number of people in household. Obviously, the sum of 'hhpera' - 'hhperg2' must be equal to 'hhsize'. To calculate sum accross rows, set the axis parameter to 1. Columns described previously have indexes 2-14.

1234567
# Importing the library import pandas as pd # Reading the file df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/f2947b09-5f0d-4ad9-992f-ec0b87cd4b3f/data2.csv') # Calculate total of columns print(df.iloc[:,2:15].sum(axis = 1))
copy

Now let's compare these values with the 'hhsize' column values. The result will be either True (if matches), or False (otherwise). If we sum those logical values, we will get the number of rows where sums match.

1234567
# Importing the library import pandas as pd # Reading the file df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/f2947b09-5f0d-4ad9-992f-ec0b87cd4b3f/data2.csv') # Calculate total of columns print(sum(df.iloc[:,2:15].sum(axis = 1) == df.hhsize))
copy

As you can see, there are 1019 observations where totals match values of respective column, which means there are 5 rows (since total number of rows is 1024) that don't match.

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 2. Capítulo 1
We're sorry to hear that something went wrong. What happened?
some-alt