Course Content

Data Manipulation using pandas

Another type of issues you may have with data is logical inconsistencies. For instance, assume you have column with total values, that considered to be the sum of some other columns. But the sum doesn't match the respective values. Or assume you have people' ages `30, 16, 46, 23`

, and the next value is `1986`

. Most likely person answered birth year instead of age.

Despite the issue origin, you need to decide whether you want to save all the observations by updating the problematic ones, or remove it. For instance, in the last example you can calculate how old is the person born in 1986, but if the total values don't match respective values, then you may consider removing them. If you want to remove some data, you need to make sure that you delete insignificant share, distibuted uniformly.

For example, there are `'hhpera', 'hhperb1', 'hhperb2', 'hhperd1', 'hhperd2', 'hhpere1', 'hhpere2', 'hhperf1', 'hhperf2', 'hhperg1', 'hhperg2', 'hhperh1', 'hhperh2'`

columns, each of them representing number of males/females in a specific age range. Also there is a column `'hhsize'`

representing the total number of people in household. Obviously, the sum of `'hhpera'`

- `'hhperg2'`

must be equal to `'hhsize'`

. To calculate sum accross rows, set the `axis`

parameter to `1`

. Columns described previously have indexes 2-14.

Now let's compare these values with the `'hhsize'`

column values. The result will be either `True`

(if matches), or `False`

(otherwise). If we sum those logical values, we will get the number of rows where sums match.

As you can see, there are 1019 observations where totals match values of respective column, which means there are 5 rows (since total number of rows is 1024) that don't match.

Section 2.

Chapter 1