Another type of issues you may have with data is logical inconsistencies. For instance, assume you have column with total values, that considered to be the sum of some other columns. But the sum doesn't match the respective values. Or assume you have people' ages `30, 16, 46, 23`, and the next value is `1986`. Most likely person answered birth year instead of age.

Despite the issue origin, you need to decide whether you want to save all the observations by updating the problematic ones, or remove it. For instance, in the last example you can calculate how old is the person born in 1986, but if the total values don't match respective values, then you may consider removing them. If you want to remove some data, you need to make sure that you delete insignificant share, distibuted uniformly.

For example, there are `'hhpera', 'hhperb1', 'hhperb2', 'hhperd1', 'hhperd2', 'hhpere1', 'hhpere2', 'hhperf1', 'hhperf2', 'hhperg1', 'hhperg2', 'hhperh1', 'hhperh2'` columns, each of them representing number of males/females in a specific age range. Also there is a column `'hhsize'` representing the total number of people in household. Obviously, the sum of `'hhpera'` - `'hhperg2'` must be equal to `'hhsize'`. To calculate sum accross rows, set the `axis` parameter to `1`. Columns described previously have indexes 2-14.

# Importing the library
import pandas as pd

# Reading the file
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/f2947b09-5f0d-4ad9-992f-ec0b87cd4b3f/data2.csv')
# Calculate total of columns
print(df.iloc[:,2:15].sum(axis = 1))

Now let's compare these values with the `'hhsize'` column values. The result will be either `True` (if matches), or `False` (otherwise). If we sum those logical values, we will get the number of rows where sums match.

# Importing the library
import pandas as pd

# Reading the file
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/f2947b09-5f0d-4ad9-992f-ec0b87cd4b3f/data2.csv')
# Calculate total of columns
print(sum(df.iloc[:,2:15].sum(axis = 1) == df.hhsize))

As you can see, there are 1019 observations where totals match values of respective column, which means there are 5 rows (since total number of rows is 1024) that don't match.

This course covers intermediate topics on pandas, a must-have tool for each data analyst. During the course, you will learn how to prepare data for further interactions and how to group it using different techniques. You will learn the easiest data visualization and be acquainted with data joining.

The data received from different sources can be messy, and to use it in the future, you must ensure it is convenient. In the first section, you will learn what data preprocessing is and will deal with some logical inconsistencies.

Missing or NA values, outliers, and inconsistencies are other types of problematic data. Throughout the second section of the course, you will learn how to deal with such issues.

As a data analyst, you will need to draw compact conclusions based on large amounts of data. In order to achieve that, you need to understand the data grouping idea and how to apply it to examples.

Sometimes one built-in function is not enough to draw a complete conclusion, so you need to use something more complex. This section will teach you how to apply multiple functions while grouping. Also, you will learn to visualize data using the pandas library only and will be acquainted with the main plot types. As a data analyst, you will need to draw compact conclusions based on large amounts of data. In order to achieve that, you need to understand the data grouping idea and how to apply it to examples.

As mentioned before, sometimes you may need to work with data received from multiple sources. This section will teach you how to join two dataframes using different techniques.

Logical Inconsistency