Handling Missing Values
In real-world data analysis, it's common to encounter datasets with missing or incomplete information. Missing values can significantly impact the quality and reliability of your analysis, so it’s important to identify such values and deal with them.
Identifying Missing Values
Before addressing missing data, it's essential to recognize how missing values are represented in NumPy arrays. Typically, NumPy uses the special value
numpy.nan to denote missing or undefined data.
To identify and locate missing values within a NumPy array, you can use the
As you can see, this function takes an array as its argument and returns a boolean array with
True identifying a
Dealing with Missing Values
Once you've identified missing values, you can choose from several strategies to handle them. We will discuss two of the most common ones:
1. Removing Missing Values
You can remove rows or columns containing missing values by applying boolean indexing:
The tilde (
~) symbol is used as a bitwise NOT operator, basically making
Falseand vice versa, so all non-missing values will be marked as
Truein the boolean array.
2. Filling Missing Values
You can replace missing values with specific values. Mean or median of the non-missing values are most commonly used for this purpose. They can be calculated using the
numpy.nanmedian() functions respectively:
As you can see, we use boolean indexing to replace every missing value with the calculated value.
When dealing with higher dimensional arrays, both of these functions calculate their respective statistics for all non-missing values in a flattened array by default (
axis=None). You can set the
axis parameter yourself to calculate the statistics along the specified axis:
temperature_data is a 2D of daily temperatures in two cities for three days. Your task is the following:
- Use the correct function to calculate the mean of non-missing values for every city.
- Specify the second keyword argument correctly to calculate the the mean for each row separately.
- Replace missing values with
average_temperaturesusing boolean indexing.
Everything was clear?