Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Detecting Missing, Duplicated, and Inconsistent Data | Data Quality Essentials
Working with Text, Dates, and Data Cleaning in R

bookDetecting Missing, Duplicated, and Inconsistent Data

Ensuring high data quality is fundamental to any data analysis project. Poor quality dataβ€”such as missing, duplicated, or inconsistent valuesβ€”can mislead your analysis, produce incorrect results, and erode trust in your findings. As you work with real-world datasets, you will often encounter these issues. Understanding how to detect and address them is the first step toward reliable insights.

123456789101112131415161718192021
# Sample data frame with missing and duplicated values df <- data.frame( id = c(1, 2, 3, 4, 4, 5, 6, NA), name = c("Alice", "Bob", "Carol", "Dave", "Dave", NA, "Eve", "Frank"), score = c(95, 88, 92, NA, NA, 88, 95, 100) ) # Detect missing values missing_mask <- is.na(df) print("Missing value mask:") print(missing_mask) # Find duplicated rows duplicate_rows <- duplicated(df) print("Duplicated rows (TRUE if duplicated):") print(duplicate_rows) # Extract unique rows unique_rows <- unique(df) print("Unique rows in the data frame:") print(unique_rows)
copy

In this code, is.na() checks each entry for missing values, returning a logical matrix that matches the structure of your data. The duplicated() function identifies rows that are repeated in the data frame, marking them as TRUE. To get a version of your data frame with only unique rows, use the unique() function.

1234567891011
# Detecting inconsistencies in categorical and numerical data # View frequency of each name (categorical) name_table <- table(df$name) print("Name frequency table:") print(name_table) # Summarize numerical column 'score' score_summary <- summary(df$score) print("Score summary statistics:") print(score_summary)
copy

The table() function displays the frequency of each unique value in a categorical column, which can help you spot unexpected categories or typos. The summary() function provides an overview of a numerical column, showing statistics like minimum, maximum, and quartiles, making it easier to detect out-of-range values or anomalies.

1. What function in R is commonly used to detect missing values in a data frame?

2. Which function helps you find duplicated rows in a dataset?

3. Why is it important to identify inconsistent data before analysis?

question mark

What function in R is commonly used to detect missing values in a data frame?

Select the correct answer

question mark

Which function helps you find duplicated rows in a dataset?

Select the correct answer

question mark

Why is it important to identify inconsistent data before analysis?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 1

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

bookDetecting Missing, Duplicated, and Inconsistent Data

Swipe to show menu

Ensuring high data quality is fundamental to any data analysis project. Poor quality dataβ€”such as missing, duplicated, or inconsistent valuesβ€”can mislead your analysis, produce incorrect results, and erode trust in your findings. As you work with real-world datasets, you will often encounter these issues. Understanding how to detect and address them is the first step toward reliable insights.

123456789101112131415161718192021
# Sample data frame with missing and duplicated values df <- data.frame( id = c(1, 2, 3, 4, 4, 5, 6, NA), name = c("Alice", "Bob", "Carol", "Dave", "Dave", NA, "Eve", "Frank"), score = c(95, 88, 92, NA, NA, 88, 95, 100) ) # Detect missing values missing_mask <- is.na(df) print("Missing value mask:") print(missing_mask) # Find duplicated rows duplicate_rows <- duplicated(df) print("Duplicated rows (TRUE if duplicated):") print(duplicate_rows) # Extract unique rows unique_rows <- unique(df) print("Unique rows in the data frame:") print(unique_rows)
copy

In this code, is.na() checks each entry for missing values, returning a logical matrix that matches the structure of your data. The duplicated() function identifies rows that are repeated in the data frame, marking them as TRUE. To get a version of your data frame with only unique rows, use the unique() function.

1234567891011
# Detecting inconsistencies in categorical and numerical data # View frequency of each name (categorical) name_table <- table(df$name) print("Name frequency table:") print(name_table) # Summarize numerical column 'score' score_summary <- summary(df$score) print("Score summary statistics:") print(score_summary)
copy

The table() function displays the frequency of each unique value in a categorical column, which can help you spot unexpected categories or typos. The summary() function provides an overview of a numerical column, showing statistics like minimum, maximum, and quartiles, making it easier to detect out-of-range values or anomalies.

1. What function in R is commonly used to detect missing values in a data frame?

2. Which function helps you find duplicated rows in a dataset?

3. Why is it important to identify inconsistent data before analysis?

question mark

What function in R is commonly used to detect missing values in a data frame?

Select the correct answer

question mark

Which function helps you find duplicated rows in a dataset?

Select the correct answer

question mark

Why is it important to identify inconsistent data before analysis?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 1
some-alt