Learn Cleaning Pipelines | Data Quality Essentials

Working with Text, Dates, and Data Cleaning in R

Swipe to show menu

Data cleaning pipelines are essential tools for anyone working with data in R. They provide a structured approach to transforming raw, messy datasets into clean, analysis-ready data. By automating a sequence of cleaning steps, pipelines not only save time but also make the process transparent and repeatable. This means that you can easily document what was done, reproduce your results, and share your workflow with others. Pipelines are especially valuable when you need to apply the same cleaning logic to multiple datasets or when collaborating in a team setting.


              12345678910111213141516171819
            
library(dplyr)
library(knitr)

# Sample data frame with issues
df <- data.frame(
  id = 1:6,
  age = c(25, NA, 30, 45, 200, 28),
  gender = c("male", "female", "femle", NA, "male", "female"),
  income = c(50000, 60000, 70000, 80000, 90000, NA)
)

# Cleaning pipeline
cleaned_df <- df %>%
  filter(!is.na(age), age < 120) %>%                       # Remove missing ages and filter outliers
  mutate(gender = ifelse(gender == "femle", "female", gender)) %>%  # Correct gender inconsistency
  filter(!is.na(gender)) %>%                               # Remove rows with missing gender
  mutate(income = ifelse(is.na(income), median(income, na.rm = TRUE), income)) # Impute missing income

kable(cleaned_df)

In this cleaning pipeline, you first remove rows with missing ages and unrealistic age values using filter(!is.na(age), age < 120). Next, you correct a common typo in the gender column by replacing "femle" with "female". Rows with missing gender are then filtered out. Finally, you handle missing income values by imputing them with the median income. The %>% pipe operator from the dplyr package allows you to chain these steps in a clear, readable sequence. This approach makes your workflow easy to follow and ensures that each transformation is applied in the correct order, improving both readability and reproducibility.


              12345678910111213
            
# Custom cleaning function
clean_data <- function(data) {
  data %>%
    filter(!is.na(age), age < 120) %>%
    mutate(gender = ifelse(gender == "femle", "female", gender)) %>%
    filter(!is.na(gender)) %>%
    mutate(income = ifelse(is.na(income), median(income, na.rm = TRUE), income))
}

# Apply the function to a data frame
cleaned_df2 <- clean_data(df)

kable(cleaned_df2)

By creating a custom cleaning function like clean_data, you make your cleaning pipeline modular and reusable. This means you can apply the same set of cleaning steps to any new dataset that shares a similar structure, reducing code duplication and potential errors. Modular functions also make it easier to maintain and update your cleaning logic as requirements change.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 7

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 3. Chapter 7

Cleaning Pipelines

1. What is the main advantage of using pipelines for data cleaning in R?

2. Which operator is used in R to chain together multiple data manipulation steps?

3. Why is it beneficial to write custom cleaning functions?