Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Handling Missing Data | Data Manipulation and Cleaning
Data Analysis with R

bookHandling Missing Data

Missing data is a common issue in real-world datasets. It can affect analysis accuracy and lead to misleading results if not properly addressed.

Detecting Missing Values

The first step is to check where and how much data is missing in your dataset.

is.na(df)              # returns a logical matrix of TRUE/FALSE
sum(is.na(df))         # total number of missing values
colSums(is.na(df))     # missing values per column

This gives a clear idea of which columns have missing data and how serious the issue is.

Removing Missing Values

Sometimes the simplest way to handle missing data is to remove rows that contain any NA values. This ensures the dataset is clean, but it can also result in significant data loss if many rows are affected.

Base R

The na.omit() function removes all rows with missing values from the dataset.

df_clean <- na.omit(df)
sum(is.na(df_clean))

dplyr

The same task can be done using the drop_na() function.

df_clean <- df %>%
  drop_na()

This approach is simple and works well when the amount of missing data is small, but may not be ideal if many rows are removed in the process.

Replacing Missing Values

Instead of dropping rows, a more effective approach is imputation, where missing values are replaced with meaningful estimates. This helps preserve the dataset size while reducing bias. A common strategy for numeric variables is to replace missing values with the column mean.

Base R

You can use logical indexing with is.na() to find missing values and assign them the mean of the column.

df$selling_price[is.na(df$selling_price)] <- mean(df$selling_price, na.rm = TRUE)

dplyr

You can also handle imputation by using ifelse() inside of mutate().

df <- df %>%
  mutate(selling_price = ifelse(is.na(selling_price),
                                mean(selling_price, na.rm = TRUE),
                                selling_price))

Filling Missing Values in Categorical Columns

For categorical variables (character or factor columns), missing values are often replaced with a fixed placeholder such as "Unknown".

Base R

df$fuel[is.na(df$fuel)] <- "Unknown"

dplyr

The replace_na() function provides a cleaner way to fill missing values.

df <- df %>%
  mutate(fuel = replace_na(fuel, "Unknown"))

This approach ensures that missing values are handled consistently and the column remains valid for reporting or modeling.

question mark

How do you replace NA in fuel column with "Unknown"?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 10

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain the difference between removing and imputing missing values?

How do I decide whether to drop or fill missing data in my dataset?

Can you show more examples of handling missing values in R?

Awesome!

Completion rate improved to 4

bookHandling Missing Data

Swipe to show menu

Missing data is a common issue in real-world datasets. It can affect analysis accuracy and lead to misleading results if not properly addressed.

Detecting Missing Values

The first step is to check where and how much data is missing in your dataset.

is.na(df)              # returns a logical matrix of TRUE/FALSE
sum(is.na(df))         # total number of missing values
colSums(is.na(df))     # missing values per column

This gives a clear idea of which columns have missing data and how serious the issue is.

Removing Missing Values

Sometimes the simplest way to handle missing data is to remove rows that contain any NA values. This ensures the dataset is clean, but it can also result in significant data loss if many rows are affected.

Base R

The na.omit() function removes all rows with missing values from the dataset.

df_clean <- na.omit(df)
sum(is.na(df_clean))

dplyr

The same task can be done using the drop_na() function.

df_clean <- df %>%
  drop_na()

This approach is simple and works well when the amount of missing data is small, but may not be ideal if many rows are removed in the process.

Replacing Missing Values

Instead of dropping rows, a more effective approach is imputation, where missing values are replaced with meaningful estimates. This helps preserve the dataset size while reducing bias. A common strategy for numeric variables is to replace missing values with the column mean.

Base R

You can use logical indexing with is.na() to find missing values and assign them the mean of the column.

df$selling_price[is.na(df$selling_price)] <- mean(df$selling_price, na.rm = TRUE)

dplyr

You can also handle imputation by using ifelse() inside of mutate().

df <- df %>%
  mutate(selling_price = ifelse(is.na(selling_price),
                                mean(selling_price, na.rm = TRUE),
                                selling_price))

Filling Missing Values in Categorical Columns

For categorical variables (character or factor columns), missing values are often replaced with a fixed placeholder such as "Unknown".

Base R

df$fuel[is.na(df$fuel)] <- "Unknown"

dplyr

The replace_na() function provides a cleaner way to fill missing values.

df <- df %>%
  mutate(fuel = replace_na(fuel, "Unknown"))

This approach ensures that missing values are handled consistently and the column remains valid for reporting or modeling.

question mark

How do you replace NA in fuel column with "Unknown"?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 10
some-alt