Detecting Missing, Duplicated, and Inconsistent Data
Ensuring high data quality is fundamental to any data analysis project. Poor quality dataβsuch as missing, duplicated, or inconsistent valuesβcan mislead your analysis, produce incorrect results, and erode trust in your findings. As you work with real-world datasets, you will often encounter these issues. Understanding how to detect and address them is the first step toward reliable insights.
123456789101112131415161718192021# Sample data frame with missing and duplicated values df <- data.frame( id = c(1, 2, 3, 4, 4, 5, 6, NA), name = c("Alice", "Bob", "Carol", "Dave", "Dave", NA, "Eve", "Frank"), score = c(95, 88, 92, NA, NA, 88, 95, 100) ) # Detect missing values missing_mask <- is.na(df) print("Missing value mask:") print(missing_mask) # Find duplicated rows duplicate_rows <- duplicated(df) print("Duplicated rows (TRUE if duplicated):") print(duplicate_rows) # Extract unique rows unique_rows <- unique(df) print("Unique rows in the data frame:") print(unique_rows)
In this code, is.na() checks each entry for missing values, returning a logical matrix that matches the structure of your data. The duplicated() function identifies rows that are repeated in the data frame, marking them as TRUE. To get a version of your data frame with only unique rows, use the unique() function.
1234567891011# Detecting inconsistencies in categorical and numerical data # View frequency of each name (categorical) name_table <- table(df$name) print("Name frequency table:") print(name_table) # Summarize numerical column 'score' score_summary <- summary(df$score) print("Score summary statistics:") print(score_summary)
The table() function displays the frequency of each unique value in a categorical column, which can help you spot unexpected categories or typos. The summary() function provides an overview of a numerical column, showing statistics like minimum, maximum, and quartiles, making it easier to detect out-of-range values or anomalies.
1. What function in R is commonly used to detect missing values in a data frame?
2. Which function helps you find duplicated rows in a dataset?
3. Why is it important to identify inconsistent data before analysis?
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 5
Detecting Missing, Duplicated, and Inconsistent Data
Swipe to show menu
Ensuring high data quality is fundamental to any data analysis project. Poor quality dataβsuch as missing, duplicated, or inconsistent valuesβcan mislead your analysis, produce incorrect results, and erode trust in your findings. As you work with real-world datasets, you will often encounter these issues. Understanding how to detect and address them is the first step toward reliable insights.
123456789101112131415161718192021# Sample data frame with missing and duplicated values df <- data.frame( id = c(1, 2, 3, 4, 4, 5, 6, NA), name = c("Alice", "Bob", "Carol", "Dave", "Dave", NA, "Eve", "Frank"), score = c(95, 88, 92, NA, NA, 88, 95, 100) ) # Detect missing values missing_mask <- is.na(df) print("Missing value mask:") print(missing_mask) # Find duplicated rows duplicate_rows <- duplicated(df) print("Duplicated rows (TRUE if duplicated):") print(duplicate_rows) # Extract unique rows unique_rows <- unique(df) print("Unique rows in the data frame:") print(unique_rows)
In this code, is.na() checks each entry for missing values, returning a logical matrix that matches the structure of your data. The duplicated() function identifies rows that are repeated in the data frame, marking them as TRUE. To get a version of your data frame with only unique rows, use the unique() function.
1234567891011# Detecting inconsistencies in categorical and numerical data # View frequency of each name (categorical) name_table <- table(df$name) print("Name frequency table:") print(name_table) # Summarize numerical column 'score' score_summary <- summary(df$score) print("Score summary statistics:") print(score_summary)
The table() function displays the frequency of each unique value in a categorical column, which can help you spot unexpected categories or typos. The summary() function provides an overview of a numerical column, showing statistics like minimum, maximum, and quartiles, making it easier to detect out-of-range values or anomalies.
1. What function in R is commonly used to detect missing values in a data frame?
2. Which function helps you find duplicated rows in a dataset?
3. Why is it important to identify inconsistent data before analysis?
Thanks for your feedback!