Logical Inconsistency

Another type of issues you may have with data is logical inconsistencies. For instance, assume you have column with total values, that considered to be the sum of some other columns. But the sum doesn't match the respective values. Or assume you have people' ages 30, 16, 46, 23, and the next value is 1986. Most likely person answered birth year instead of age.

Despite the issue origin, you need to decide whether you want to save all the observations by updating the problematic ones, or remove it. For instance, in the last example you can calculate how old is the person born in 1986, but if the total values don't match respective values, then you may consider removing them. If you want to remove some data, you need to make sure that you delete insignificant share, distibuted uniformly.

For example, there are 'hhpera', 'hhperb1', 'hhperb2', 'hhperd1', 'hhperd2', 'hhpere1', 'hhpere2', 'hhperf1', 'hhperf2', 'hhperg1', 'hhperg2', 'hhperh1', 'hhperh2' columns, each of them representing number of males/females in a specific age range. Also there is a column 'hhsize' representing the total number of people in household. Obviously, the sum of 'hhpera' - 'hhperg2' must be equal to 'hhsize'. To calculate sum accross rows, set the axis parameter to 1. Columns described previously have indexes 2-14.


              1234567
            
# Importing the library
import pandas as pd

# Reading the file
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/f2947b09-5f0d-4ad9-992f-ec0b87cd4b3f/data2.csv')
# Calculate total of columns
print(df.iloc[:,2:15].sum(axis = 1))

Now let's compare these values with the 'hhsize' column values. The result will be either True (if matches), or False (otherwise). If we sum those logical values, we will get the number of rows where sums match.


              1234567
            
# Importing the library
import pandas as pd

# Reading the file
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/f2947b09-5f0d-4ad9-992f-ec0b87cd4b3f/data2.csv')
# Calculate total of columns
print(sum(df.iloc[:,2:15].sum(axis = 1) == df.hhsize))

As you can see, there are 1019 observations where totals match values of respective column, which means there are 5 rows (since total number of rows is 1024) that don't match.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Data Manipulation using pandas

1. Preprocessing Data: Part I

What is Data Preprocessing?Types consistency Poor Data Presentation Manipulating Strings Challenge Replacing Specific Elements Simultaneous Replacement Challenge

2. Preprocessing Data: Part II

Logical Inconsistency Removing Rows Challenge Outliers Challenge Missing Values Filling NA values Challenge

3. Grouping Data

What is Grouping Data?Grouping in pandas [1/2]Challenge Grouping in pandas [2/2]Challenge Grouping by Several Columns Challenge

4. Aggregating and Visualizing Data

Advanced Aggregation [1/2]Challenge Advanced Aggregation [2/2]Challenge Histograms Challenge Bar and Scatter Plots Other Types of Graphs Challenge 1 Challenge 2

5. Joining Data

What is Joining Data?Left Join Right Join Inner Join Outer Join Concatenation

Logical Inconsistency


              1234567
            
# Importing the library
import pandas as pd

# Reading the file
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/f2947b09-5f0d-4ad9-992f-ec0b87cd4b3f/data2.csv')
# Calculate total of columns
print(df.iloc[:,2:15].sum(axis = 1))


              1234567
            
# Importing the library
import pandas as pd

# Reading the file
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/f2947b09-5f0d-4ad9-992f-ec0b87cd4b3f/data2.csv')
# Calculate total of columns
print(sum(df.iloc[:,2:15].sum(axis = 1) == df.hhsize))

As you can see, there are 1019 observations where totals match values of respective column, which means there are 5 rows (since total number of rows is 1024) that don't match.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 1