Apprendre Reshaping Data with Pivot Functions | Feature Engineering and Data Transformation

Glissez pour afficher le menu

As you prepare data for analysis or modeling, you often encounter datasets that are not structured in the most useful way. Sometimes, data is in a wide format, where each variable gets its own column, but your analysis or model expects a long format, where observations are stacked in rows. Other times, you may need to go from long to wide for reporting or visualization. Pivoting data allows you to reshape datasets between these formats, making them easier to work with for different analytical tasks. This flexibility is essential when preparing features for machine learning, aggregating results, or visualizing trends over time.


              12345678910111213141516171819202122232425262728
            
library(tidyr)
library(dplyr)

# Sample data in wide format
scores <- data.frame(
  student = c("Alice", "Bob", "Carol"),
  math = c(90, 85, 88),
  english = c(95, 80, 92)
)
print(scores)

# Pivot from wide to long format
scores_long <- pivot_longer(
  scores,
  cols = c(math, english),
  names_to = "subject",
  values_to = "score"
)
print(as.data.frame(scores_long))

# Pivot back from long to wide format
scores_wide <- pivot_wider(
  scores_long,
  id_cols = student,
  names_from = subject,
  values_from = score
)
print(as.data.frame(scores_wide))

When you use pivot_longer(), the cols argument specifies which columns to reshape into longer format. The names_to argument tells R what to call the new column that will contain the names of the original columns (like "subject" in the example). The values_to argument sets the name for the new column that will store the values from those columns (like "score"). For pivot_wider(), the id_cols argument identifies the columns that should remain as identifiers (such as "student"), while names_from and values_from decide which columns create new headers and which supply their values.

Note

Be careful when pivoting data — if you have duplicate combinations of identifier columns and pivoted columns, you might lose data or get unexpected results. Also, duplicate column names can cause errors or overwrite data during the pivot process. Always check your data for unique identifiers before reshaping.

Tout était clair ?

Merci pour vos commentaires !

Section 2. Chapitre 2

Demandez à l'IA

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Section 2. Chapitre 2