Lernen Grouping Data with group_by() | Grouping and Aggregation in R

Data Manipulation in R

Swipe um das Menü anzuzeigen

Grouping data is a fundamental technique in analytics, especially in business contexts where you often need to analyze performance by categories such as region, product, or customer segment. By breaking down your data into meaningful groups, you can uncover insights that are hidden when looking only at overall totals. For example, grouping sales data by region enables you to compare how each area is performing, identify trends, and make targeted business decisions.


              1234567891011121314
            
library(dplyr)

# Sample sales data frame
sales_data <- data.frame(
  region = c("North", "South", "East", "West", "North", "South"),
  sales = c(200, 150, 300, 250, 180, 210)
)

# Group sales data by region
sales_by_region <- sales_data %>%
  group_by(region)

library(knitr)
kable(sales_by_region)

The group_by() function from dplyr is used to specify how you want to segment your data for further analysis. In the code above, you grouped the sales data by the region column. This tells R to treat each unique region as a separate group, setting the stage for calculations or summaries within each region.


              123456
            
# Calculate total sales per region
total_sales_per_region <- sales_data %>%
  group_by(region) %>%
  summarise(total_sales = sum(sales))

kable(total_sales_per_region)

When you use group_by() together with summarise(), you can quickly compute summary statistics for each group. In the previous example, after grouping the data by region, you used summarise() to calculate the total sales for each region. This workflow allows you to move from raw, detailed data to concise, actionable summaries that are essential for business reporting and decision making.

Definition

A grouped data frame is a special version of a data frame created by group_by(). Once data is grouped, many dplyr verbs (like summarise(), mutate(), or filter()) operate within each group rather than on the whole data set. This means calculations or transformations are performed separately for each group, making it easier to analyze data by segment.