Course Content

Ultimate NumPy

## Handling Missing Values

In real-world data analysis, it's common to encounter datasets with missing or incomplete information. Missing values can significantly impact the quality and reliability of your analysis, so it’s important to identify such values and deal with them.

### Identifying Missing Values

Before addressing missing data, it's essential to recognize how missing values are represented in NumPy arrays. Typically, NumPy uses the special value `numpy.nan` to denote missing or undefined data.

To identify and locate missing values within a NumPy array, you can use the `numpy.isnan()` function:

As you can see, this function takes an array as its argument and returns a boolean array with `True` identifying a `numpy.nan` value.

### Dealing with Missing Values

Once you've identified missing values, you can choose from several strategies to handle them. We will discuss two of the most common ones:

1. Removing Missing Values

You can remove rows or columns containing missing values by applying boolean indexing:

Note

The tilde (`~`) symbol is used as a bitwise NOT operator, basically making `True` values `False` and vice versa, so all non-missing values will be marked as `True` in the boolean array.

2. Filling Missing Values

You can replace missing values with specific values. Mean or median of the non-missing values are most commonly used for this purpose. They can be calculated using the `numpy.nanmean()` and `numpy.nanmedian()` functions respectively:

As you can see, we use boolean indexing to replace every missing value with the calculated value.

When dealing with higher dimensional arrays, both of these functions calculate their respective statistics for all non-missing values in a flattened array by default (`axis=None`). You can set the `axis` parameter yourself to calculate the statistics along the specified axis:

In case you want to explore all parameters of these functions, you can refer to their documentation: numpy.nanmean, numpy.nanmedian.

`temperature_data` is a 2D of daily temperatures in two cities for three days. Your task is the following:
3. Replace missing values with `average_temperatures` using boolean indexing.