Lære Tidying Data with tidyr | Data Cleaning and Wrangling Essentials

Sveip for å vise menyen

Definition

Tidy data is a way of organizing datasets so that each variable forms a column, each observation forms a row, and each type of observational unit forms a table. The principles behind tidy data ensure that your data is structured in a consistent and predictable way, making it easier to manipulate, analyze, and visualize.

When working with real-world data, you often encounter datasets that are not structured for easy analysis. The tidyr package in R provides powerful tools for reshaping and tidying your data. Two of the most important functions are pivot_longer() and pivot_wider(). These functions help you convert between wide and long data formats, which is essential for many analysis and visualization tasks.


              12345678910111213141516171819202122
            
# Simulate a wide dataset
library(tidyr)
library(dplyr)

wide_data <- tibble(
  id = 1:3,
  math_2022 = c(85, 90, 78),
  math_2023 = c(88, 92, 80),
  science_2022 = c(80, 85, 75),
  science_2023 = c(83, 89, 78)
)

# Convert wide data to long format using pivot_longer()
long_data <- wide_data %>%
  pivot_longer(
    cols = -id,
    names_to = c("subject", "year"),
    names_sep = "_",
    values_to = "score"
  )

print(long_data)

You use pivot_wider() when you want to spread long-format data into a wide format, often to make comparisons across columns or to prepare data for specific types of analysis. This is useful when each row in your dataset represents a single observation, but you want to see each variable or measurement as its own column.

Key situations to use pivot_wider():

Transforming long data so each measurement type becomes a separate column;
Making it easier to compare values across categories or time periods;
Preparing data for functions or visualizations that require wide format.

Example: If you have a dataset where each row records a student's score in a subject and year, you can use pivot_wider() to create separate columns for each subject and year combination, making side-by-side comparison straightforward.


              1234567891011121314151617181920
            
# Simulate a dataset with combined columns
library(tidyr)
library(dplyr)

data <- tibble(
  id = 1:3,
  info = c("A_2022", "B_2023", "C_2022"),
  score = c(90, 85, 88)
)

# Separate the 'info' column into 'group' and 'year'
separated_data <- data %>%
  separate(info, into = c("group", "year"), sep = "_")

# Unite 'group' and 'year' back into a single column
united_data <- separated_data %>%
  unite("info", group, year, sep = "_")

print(separated_data)
print(united_data)

Tidying data is a crucial step for preparing your datasets for analysis or visualization. Clean and well-structured data makes it easier to use statistical functions, create plots, and share results with others. Functions like pivot_longer(), pivot_wider(), separate(), and unite() from the tidyr package give you the flexibility to reshape your data as needed for your analysis goals.

Key use cases for tidying data:

Prepare raw datasets for statistical analysis;
Structure data for creating clear, effective visualizations;
Ensure compatibility with R functions and modeling tools;
Make it easy to share and reproduce your data workflows;
Simplify the process of identifying and correcting data issues.

1. What is the main goal of tidying data?

2. Which tidyr function would you use to convert wide data to long format?

3. When might you use `separate()` in tidyr?

Alt var klart?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 9

Spør AI

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 1. Kapittel 9

Tidying Data with tidyr

1. What is the main goal of tidying data?

2. Which tidyr function would you use to convert wide data to long format?

3. When might you use separate() in tidyr?

3. When might you use `separate()` in tidyr?