Learn Detecting Missing, Duplicated, and Inconsistent Data

Working with Text, Dates, and Data Cleaning in R

Swipe to show menu

Ensuring high data quality is fundamental to any data analysis project. Poor quality data—such as missing, duplicated, or inconsistent values—can mislead your analysis, produce incorrect results, and erode trust in your findings. As you work with real-world datasets, you will often encounter these issues. Understanding how to detect and address them is the first step toward reliable insights.


              123456789101112131415161718192021
            
# Sample data frame with missing and duplicated values
df <- data.frame(
  id = c(1, 2, 3, 4, 4, 5, 6, NA),
  name = c("Alice", "Bob", "Carol", "Dave", "Dave", NA, "Eve", "Frank"),
  score = c(95, 88, 92, NA, NA, 88, 95, 100)
)

# Detect missing values
missing_mask <- is.na(df)
print("Missing value mask:")
print(missing_mask)

# Find duplicated rows
duplicate_rows <- duplicated(df)
print("Duplicated rows (TRUE if duplicated):")
print(duplicate_rows)

# Extract unique rows
unique_rows <- unique(df)
print("Unique rows in the data frame:")
print(unique_rows)

In this code, is.na() checks each entry for missing values, returning a logical matrix that matches the structure of your data. The duplicated() function identifies rows that are repeated in the data frame, marking them as TRUE. To get a version of your data frame with only unique rows, use the unique() function.


              1234567891011
            
# Detecting inconsistencies in categorical and numerical data

# View frequency of each name (categorical)
name_table <- table(df$name)
print("Name frequency table:")
print(name_table)

# Summarize numerical column 'score'
score_summary <- summary(df$score)
print("Score summary statistics:")
print(score_summary)

The table() function displays the frequency of each unique value in a categorical column, which can help you spot unexpected categories or typos. The summary() function provides an overview of a numerical column, showing statistics like minimum, maximum, and quartiles, making it easier to detect out-of-range values or anomalies.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 3. Chapter 1

Detecting Missing, Duplicated, and Inconsistent Data

1. What function in R is commonly used to detect missing values in a data frame?

2. Which function helps you find duplicated rows in a dataset?

3. Why is it important to identify inconsistent data before analysis?