Learn Handling Missing Data | Data Manipulation and Cleaning

Missing data is a common issue in real-world datasets. It can affect analysis accuracy and lead to misleading results if not properly addressed.

Detecting Missing Values

The first step is to check where and how much data is missing in your dataset.

is.na(df)              # returns a logical matrix of TRUE/FALSE
sum(is.na(df))         # total number of missing values
colSums(is.na(df))     # missing values per column

This gives a clear idea of which columns have missing data and how serious the issue is.

Removing Missing Values

Sometimes the simplest way to handle missing data is to remove rows that contain any NA values. This ensures the dataset is clean, but it can also result in significant data loss if many rows are affected.

Base R

The na.omit() function removes all rows with missing values from the dataset.

df_clean <- na.omit(df)
sum(is.na(df_clean))

dplyr

The same task can be done using the drop_na() function.

df_clean <- df %>%
  drop_na()

This approach is simple and works well when the amount of missing data is small, but may not be ideal if many rows are removed in the process.

Replacing Missing Values

Instead of dropping rows, a more effective approach is imputation, where missing values are replaced with meaningful estimates. This helps preserve the dataset size while reducing bias. A common strategy for numeric variables is to replace missing values with the column mean.

Base R

You can use logical indexing with is.na() to find missing values and assign them the mean of the column.

df$selling_price[is.na(df$selling_price)] <- mean(df$selling_price, na.rm = TRUE)

dplyr

You can also handle imputation by using ifelse() inside of mutate().

df <- df %>%
  mutate(selling_price = ifelse(is.na(selling_price),
                                mean(selling_price, na.rm = TRUE),
                                selling_price))

Filling Missing Values in Categorical Columns

For categorical variables (character or factor columns), missing values are often replaced with a fixed placeholder such as "Unknown".

Base R

df$fuel[is.na(df$fuel)] <- "Unknown"

dplyr

The replace_na() function provides a cleaner way to fill missing values.

df <- df %>%
  mutate(fuel = replace_na(fuel, "Unknown"))

This approach ensures that missing values are handled consistently and the column remains valid for reporting or modeling.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 10

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain the difference between removing and imputing missing values?

How do I decide whether to drop or fill missing data in my dataset?

Can you show more examples of handling missing values in R?

Swipe to show menu

Missing data is a common issue in real-world datasets. It can affect analysis accuracy and lead to misleading results if not properly addressed.

Detecting Missing Values

The first step is to check where and how much data is missing in your dataset.

is.na(df)              # returns a logical matrix of TRUE/FALSE
sum(is.na(df))         # total number of missing values
colSums(is.na(df))     # missing values per column

This gives a clear idea of which columns have missing data and how serious the issue is.

Removing Missing Values

Base R

The na.omit() function removes all rows with missing values from the dataset.

df_clean <- na.omit(df)
sum(is.na(df_clean))

dplyr

The same task can be done using the drop_na() function.

df_clean <- df %>%
  drop_na()

This approach is simple and works well when the amount of missing data is small, but may not be ideal if many rows are removed in the process.

Replacing Missing Values

Base R

You can use logical indexing with is.na() to find missing values and assign them the mean of the column.

df$selling_price[is.na(df$selling_price)] <- mean(df$selling_price, na.rm = TRUE)

dplyr

You can also handle imputation by using ifelse() inside of mutate().

df <- df %>%
  mutate(selling_price = ifelse(is.na(selling_price),
                                mean(selling_price, na.rm = TRUE),
                                selling_price))

Filling Missing Values in Categorical Columns

For categorical variables (character or factor columns), missing values are often replaced with a fixed placeholder such as "Unknown".

Base R

df$fuel[is.na(df$fuel)] <- "Unknown"

dplyr

The replace_na() function provides a cleaner way to fill missing values.

df <- df %>%
  mutate(fuel = replace_na(fuel, "Unknown"))

This approach ensures that missing values are handled consistently and the column remains valid for reporting or modeling.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 10