Handling Missing ValuesHandling Missing Values

In real-world data analysis, it's common to encounter datasets with missing or incomplete information. Missing values can significantly impact the quality and reliability of your analysis, so it’s important to identify such values and deal with them.

Identifying Missing Values

Before addressing missing data, it's essential to recognize how missing values are represented in NumPy arrays. Typically, NumPy uses the special value numpy.nan to denote missing or undefined data.

To identify and locate missing values within a NumPy array, you can use the numpy.isnan() function:

As you can see, this function takes an array as its argument and returns a boolean array with True identifying a numpy.nan value.

Dealing with Missing Values

Once you've identified missing values, you can choose from several strategies to handle them. We will discuss two of the most common ones:

1. Removing Missing Values

You can remove rows or columns containing missing values by applying boolean indexing:

Note

The tilde (~) symbol is used as a bitwise NOT operator, basically making True values False and vice versa, so all non-missing values will be marked as True in the boolean array.

2. Filling Missing Values

You can replace missing values with specific values. Mean or median of the non-missing values are most commonly used for this purpose. They can be calculated using the numpy.nanmean() and numpy.nanmedian() functions respectively:

As you can see, we use boolean indexing to replace every missing value with the calculated value.

When dealing with higher dimensional arrays, both of these functions calculate their respective statistics for all non-missing values in a flattened array by default (axis=None). You can set the axis parameter yourself to calculate the statistics along the specified axis:

In case you want to explore all parameters of these functions, you can refer to their documentation: numpy.nanmean, numpy.nanmedian.

Task

temperature_data is a 2D of daily temperatures in two cities for three days. Your task is the following:

  1. Use the correct function to calculate the mean of non-missing values for every city.
  2. Specify the second keyword argument correctly to calculate the the mean for each row separately.
  3. Replace missing values with average_temperatures using boolean indexing.

Everything was clear?

Section 4. Chapter 4
toggle bottom row