Handling Missing Data
Missing data is a common issue in real-world datasets. It can affect analysis accuracy and lead to misleading results if not properly addressed.
Detecting Missing Values
The first step is to check where and how much data is missing in your dataset.
is.na(df) # returns a logical matrix of TRUE/FALSE
sum(is.na(df)) # total number of missing values
colSums(is.na(df)) # missing values per column
This gives a clear idea of which columns have missing data and how serious the issue is.
Removing Missing Values
Sometimes the simplest way to handle missing data is to remove rows that contain any NA
values. This ensures the dataset is clean, but it can also result in significant data loss if many rows are affected.
Base R
The na.omit()
function removes all rows with missing values from the dataset.
df_clean <- na.omit(df)
sum(is.na(df_clean))
dplyr
The same task can be done using the drop_na()
function.
df_clean <- df %>%
drop_na()
This approach is simple and works well when the amount of missing data is small, but may not be ideal if many rows are removed in the process.
Replacing Missing Values
Instead of dropping rows, a more effective approach is imputation, where missing values are replaced with meaningful estimates. This helps preserve the dataset size while reducing bias. A common strategy for numeric variables is to replace missing values with the column mean.
Base R
You can use logical indexing with is.na()
to find missing values and assign them the mean of the column.
df$selling_price[is.na(df$selling_price)] <- mean(df$selling_price, na.rm = TRUE)
dplyr
You can also handle imputation by using ifelse()
inside of mutate()
.
df <- df %>%
mutate(selling_price = ifelse(is.na(selling_price),
mean(selling_price, na.rm = TRUE),
selling_price))
Filling Missing Values in Categorical Columns
For categorical variables (character or factor columns), missing values are often replaced with a fixed placeholder such as "Unknown"
.
Base R
df$fuel[is.na(df$fuel)] <- "Unknown"
dplyr
The replace_na()
function provides a cleaner way to fill missing values.
df <- df %>%
mutate(fuel = replace_na(fuel, "Unknown"))
This approach ensures that missing values are handled consistently and the column remains valid for reporting or modeling.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Can you explain the difference between removing and imputing missing values?
How do I decide whether to drop or fill missing data in my dataset?
Can you show more examples of handling missing values in R?
Awesome!
Completion rate improved to 4
Handling Missing Data
Swipe to show menu
Missing data is a common issue in real-world datasets. It can affect analysis accuracy and lead to misleading results if not properly addressed.
Detecting Missing Values
The first step is to check where and how much data is missing in your dataset.
is.na(df) # returns a logical matrix of TRUE/FALSE
sum(is.na(df)) # total number of missing values
colSums(is.na(df)) # missing values per column
This gives a clear idea of which columns have missing data and how serious the issue is.
Removing Missing Values
Sometimes the simplest way to handle missing data is to remove rows that contain any NA
values. This ensures the dataset is clean, but it can also result in significant data loss if many rows are affected.
Base R
The na.omit()
function removes all rows with missing values from the dataset.
df_clean <- na.omit(df)
sum(is.na(df_clean))
dplyr
The same task can be done using the drop_na()
function.
df_clean <- df %>%
drop_na()
This approach is simple and works well when the amount of missing data is small, but may not be ideal if many rows are removed in the process.
Replacing Missing Values
Instead of dropping rows, a more effective approach is imputation, where missing values are replaced with meaningful estimates. This helps preserve the dataset size while reducing bias. A common strategy for numeric variables is to replace missing values with the column mean.
Base R
You can use logical indexing with is.na()
to find missing values and assign them the mean of the column.
df$selling_price[is.na(df$selling_price)] <- mean(df$selling_price, na.rm = TRUE)
dplyr
You can also handle imputation by using ifelse()
inside of mutate()
.
df <- df %>%
mutate(selling_price = ifelse(is.na(selling_price),
mean(selling_price, na.rm = TRUE),
selling_price))
Filling Missing Values in Categorical Columns
For categorical variables (character or factor columns), missing values are often replaced with a fixed placeholder such as "Unknown"
.
Base R
df$fuel[is.na(df$fuel)] <- "Unknown"
dplyr
The replace_na()
function provides a cleaner way to fill missing values.
df <- df %>%
mutate(fuel = replace_na(fuel, "Unknown"))
This approach ensures that missing values are handled consistently and the column remains valid for reporting or modeling.
Thanks for your feedback!