Detecting Missing, Duplicated, and Inconsistent Data
Ensuring high data quality is fundamental to any data analysis project. Poor quality data—such as missing, duplicated, or inconsistent values—can mislead your analysis, produce incorrect results, and erode trust in your findings. As you work with real-world datasets, you will often encounter these issues. Understanding how to detect and address them is the first step toward reliable insights.
123456789101112131415161718192021# Sample data frame with missing and duplicated values df <- data.frame( id = c(1, 2, 3, 4, 4, 5, 6, NA), name = c("Alice", "Bob", "Carol", "Dave", "Dave", NA, "Eve", "Frank"), score = c(95, 88, 92, NA, NA, 88, 95, 100) ) # Detect missing values missing_mask <- is.na(df) print("Missing value mask:") print(missing_mask) # Find duplicated rows duplicate_rows <- duplicated(df) print("Duplicated rows (TRUE if duplicated):") print(duplicate_rows) # Extract unique rows unique_rows <- unique(df) print("Unique rows in the data frame:") print(unique_rows)
In this code, is.na() checks each entry for missing values, returning a logical matrix that matches the structure of your data. The duplicated() function identifies rows that are repeated in the data frame, marking them as TRUE. To get a version of your data frame with only unique rows, use the unique() function.
1234567891011# Detecting inconsistencies in categorical and numerical data # View frequency of each name (categorical) name_table <- table(df$name) print("Name frequency table:") print(name_table) # Summarize numerical column 'score' score_summary <- summary(df$score) print("Score summary statistics:") print(score_summary)
The table() function displays the frequency of each unique value in a categorical column, which can help you spot unexpected categories or typos. The summary() function provides an overview of a numerical column, showing statistics like minimum, maximum, and quartiles, making it easier to detect out-of-range values or anomalies.
1. What function in R is commonly used to detect missing values in a data frame?
2. Which function helps you find duplicated rows in a dataset?
3. Why is it important to identify inconsistent data before analysis?
Дякуємо за ваш відгук!
Запитати АІ
Запитати АІ
Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат
How can I handle or fill in the missing values in my data?
What should I do if I find duplicated rows in my dataset?
How can I address inconsistencies or unexpected categories in my data?
Чудово!
Completion показник покращився до 5
Detecting Missing, Duplicated, and Inconsistent Data
Свайпніть щоб показати меню
Ensuring high data quality is fundamental to any data analysis project. Poor quality data—such as missing, duplicated, or inconsistent values—can mislead your analysis, produce incorrect results, and erode trust in your findings. As you work with real-world datasets, you will often encounter these issues. Understanding how to detect and address them is the first step toward reliable insights.
123456789101112131415161718192021# Sample data frame with missing and duplicated values df <- data.frame( id = c(1, 2, 3, 4, 4, 5, 6, NA), name = c("Alice", "Bob", "Carol", "Dave", "Dave", NA, "Eve", "Frank"), score = c(95, 88, 92, NA, NA, 88, 95, 100) ) # Detect missing values missing_mask <- is.na(df) print("Missing value mask:") print(missing_mask) # Find duplicated rows duplicate_rows <- duplicated(df) print("Duplicated rows (TRUE if duplicated):") print(duplicate_rows) # Extract unique rows unique_rows <- unique(df) print("Unique rows in the data frame:") print(unique_rows)
In this code, is.na() checks each entry for missing values, returning a logical matrix that matches the structure of your data. The duplicated() function identifies rows that are repeated in the data frame, marking them as TRUE. To get a version of your data frame with only unique rows, use the unique() function.
1234567891011# Detecting inconsistencies in categorical and numerical data # View frequency of each name (categorical) name_table <- table(df$name) print("Name frequency table:") print(name_table) # Summarize numerical column 'score' score_summary <- summary(df$score) print("Score summary statistics:") print(score_summary)
The table() function displays the frequency of each unique value in a categorical column, which can help you spot unexpected categories or typos. The summary() function provides an overview of a numerical column, showing statistics like minimum, maximum, and quartiles, making it easier to detect out-of-range values or anomalies.
1. What function in R is commonly used to detect missing values in a data frame?
2. Which function helps you find duplicated rows in a dataset?
3. Why is it important to identify inconsistent data before analysis?
Дякуємо за ваш відгук!