Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Detecting Missing, Duplicated, and Inconsistent Data | Data Quality Essentials
Working with Text, Dates, and Data Cleaning in R

bookDetecting Missing, Duplicated, and Inconsistent Data

Ensuring high data quality is fundamental to any data analysis project. Poor quality data—such as missing, duplicated, or inconsistent values—can mislead your analysis, produce incorrect results, and erode trust in your findings. As you work with real-world datasets, you will often encounter these issues. Understanding how to detect and address them is the first step toward reliable insights.

123456789101112131415161718192021
# Sample data frame with missing and duplicated values df <- data.frame( id = c(1, 2, 3, 4, 4, 5, 6, NA), name = c("Alice", "Bob", "Carol", "Dave", "Dave", NA, "Eve", "Frank"), score = c(95, 88, 92, NA, NA, 88, 95, 100) ) # Detect missing values missing_mask <- is.na(df) print("Missing value mask:") print(missing_mask) # Find duplicated rows duplicate_rows <- duplicated(df) print("Duplicated rows (TRUE if duplicated):") print(duplicate_rows) # Extract unique rows unique_rows <- unique(df) print("Unique rows in the data frame:") print(unique_rows)
copy

In this code, is.na() checks each entry for missing values, returning a logical matrix that matches the structure of your data. The duplicated() function identifies rows that are repeated in the data frame, marking them as TRUE. To get a version of your data frame with only unique rows, use the unique() function.

1234567891011
# Detecting inconsistencies in categorical and numerical data # View frequency of each name (categorical) name_table <- table(df$name) print("Name frequency table:") print(name_table) # Summarize numerical column 'score' score_summary <- summary(df$score) print("Score summary statistics:") print(score_summary)
copy

The table() function displays the frequency of each unique value in a categorical column, which can help you spot unexpected categories or typos. The summary() function provides an overview of a numerical column, showing statistics like minimum, maximum, and quartiles, making it easier to detect out-of-range values or anomalies.

1. What function in R is commonly used to detect missing values in a data frame?

2. Which function helps you find duplicated rows in a dataset?

3. Why is it important to identify inconsistent data before analysis?

question mark

What function in R is commonly used to detect missing values in a data frame?

Select the correct answer

question mark

Which function helps you find duplicated rows in a dataset?

Select the correct answer

question mark

Why is it important to identify inconsistent data before analysis?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 3. Kapitel 1

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Suggested prompts:

How can I handle or fill in the missing values in my data?

What should I do if I find duplicated rows in my dataset?

How can I address inconsistencies or unexpected categories in my data?

bookDetecting Missing, Duplicated, and Inconsistent Data

Swipe um das Menü anzuzeigen

Ensuring high data quality is fundamental to any data analysis project. Poor quality data—such as missing, duplicated, or inconsistent values—can mislead your analysis, produce incorrect results, and erode trust in your findings. As you work with real-world datasets, you will often encounter these issues. Understanding how to detect and address them is the first step toward reliable insights.

123456789101112131415161718192021
# Sample data frame with missing and duplicated values df <- data.frame( id = c(1, 2, 3, 4, 4, 5, 6, NA), name = c("Alice", "Bob", "Carol", "Dave", "Dave", NA, "Eve", "Frank"), score = c(95, 88, 92, NA, NA, 88, 95, 100) ) # Detect missing values missing_mask <- is.na(df) print("Missing value mask:") print(missing_mask) # Find duplicated rows duplicate_rows <- duplicated(df) print("Duplicated rows (TRUE if duplicated):") print(duplicate_rows) # Extract unique rows unique_rows <- unique(df) print("Unique rows in the data frame:") print(unique_rows)
copy

In this code, is.na() checks each entry for missing values, returning a logical matrix that matches the structure of your data. The duplicated() function identifies rows that are repeated in the data frame, marking them as TRUE. To get a version of your data frame with only unique rows, use the unique() function.

1234567891011
# Detecting inconsistencies in categorical and numerical data # View frequency of each name (categorical) name_table <- table(df$name) print("Name frequency table:") print(name_table) # Summarize numerical column 'score' score_summary <- summary(df$score) print("Score summary statistics:") print(score_summary)
copy

The table() function displays the frequency of each unique value in a categorical column, which can help you spot unexpected categories or typos. The summary() function provides an overview of a numerical column, showing statistics like minimum, maximum, and quartiles, making it easier to detect out-of-range values or anomalies.

1. What function in R is commonly used to detect missing values in a data frame?

2. Which function helps you find duplicated rows in a dataset?

3. Why is it important to identify inconsistent data before analysis?

question mark

What function in R is commonly used to detect missing values in a data frame?

Select the correct answer

question mark

Which function helps you find duplicated rows in a dataset?

Select the correct answer

question mark

Why is it important to identify inconsistent data before analysis?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 3. Kapitel 1
some-alt