Summary  
This chapter shows how to group data by a categorical variable and compute summary statistics (e.g., mean) for each group using a pipeline of group_by and summarize operations.  

General domain of usage  
Exploratory data analysis

When analyzing data, you often need to understand how different subgroups behave. For example, you might want to know the average sales for each region, or the mean height by gender. Calculating summary statistics — such as **mean**, **median**, or **count** — by group helps you uncover patterns and differences that would be hidden in overall averages. This is a key step in exploratory data analysis, reporting, and feature engineering, enabling you to generate insights that are specific to categories within your data.

# Load the dplyr package
library(dplyr)

# Example data frame
df <- data.frame(
  group = c("A", "A", "B", "B", "B", "C"),
  value = c(10, 12, 8, 9, 7, 15)
)

# Calculate the mean value for each group
summary_df <- df %>%
  group_by(group) %>%
  summarize(mean_value = mean(value, na.rm = TRUE))

print(as.data.frame(summary_df))

The code begins by loading the `dplyr` package and creating a simple data frame with a categorical `group` column and a numeric `value` column. When you use `group_by(group)`, you tell R to treat rows with the same group label as a unit for the next operations. The `summarize(mean_value = mean(value, na.rm = TRUE))` function then calculates the mean of `value` for each group, producing a new summary data frame with one row per group and the computed mean.

After summarizing, the resulting data is a tibble where each row represents a unique group and its corresponding mean. If you continue chaining operations, remember that the data remains grouped unless you explicitly call `ungroup()`. This can affect later calculations, so always check if your data is still grouped when running further transformations.

A common pitfall is forgetting to use `ungroup()` after summarizing, which can lead to unexpected results in subsequent operations. Also, be aware that if your group contains only missing values, summary functions like `mean()` will return `NA` unless you specify `na.rm = TRUE`.

Note

Which function is used to group data by a categorical variable in dplyr?

Master practical data science in R by learning data cleaning, modeling, evaluation, and machine learning workflows through hands-on code. Build fluency with R syntax, functions, and outputs for real-world data science tasks.

Learn to wrangle, clean, and prepare data in R using practical, code-driven workflows.

Engineer features and reshape data for modeling using R’s tidyverse tools.

Fit, interpret, and use regression and classification models with R code.

Evaluate models and build simple machine learning pipelines in R.

Grouping and Summarizing Data