Aprenda Real-World Data Integration Scenarios

When working with real-world analytics projects, you often need to bring together data from multiple sources to gain a complete understanding of your business. For instance, a retail company might want to analyze sales performance by combining sales transactions, inventory details, and customer information. By integrating these data frames, you can answer questions such as which products are selling best, who is buying them, and whether you have enough stock to meet demand. This kind of data integration is fundamental for generating comprehensive reports and driving informed business decisions.


              12345678910111213141516171819202122232425262728
            
# Sample data frames
sales <- data.frame(
  sale_id = 1:4,
  product_id = c(101, 102, 103, 101),
  customer_id = c(1001, 1002, 1003, 1002),
  quantity = c(2, 1, 5, 3)
)

products <- data.frame(
  product_id = c(101, 102, 103),
  product_name = c("Widget", "Gadget", "Thingamajig"),
  inventory = c(20, 15, 0)
)

customers <- data.frame(
  customer_id = c(1001, 1002, 1003),
  customer_name = c("Alice", "Bob", "Charlie"),
  region = c("East", "West", "East")
)

library(dplyr)

# Joining sales with products, then with customers
sales_report <- sales %>%
  left_join(products, by = "product_id") %>%
  left_join(customers, by = "customer_id")

print(sales_report)

In this integration workflow, you first join the sales data frame with the products data frame using the product_id key. This step enriches each sales transaction with product details such as the name and inventory count. Next, you join the resulting data frame with the customers data frame using the customer_id key. This final join adds customer names and regions, creating a comprehensive view of each sale, including what was sold, to whom, and the available inventory. The sequence of joins ensures that all relevant information is consolidated for reporting and analysis.


              12345678910111213141516
            
library(tidyr)

# Simulate missing inventory data for a new product
products_missing <- data.frame(
  product_id = c(101, 102, 103, 104),
  product_name = c("Widget", "Gadget", "Thingamajig", "Doohickey"),
  inventory = c(20, 15, 0, NA)
)

# Join and handle missing inventory values
sales_report_missing <- sales %>%
  left_join(products_missing, by = "product_id") %>%
  left_join(customers, by = "customer_id") %>%
  mutate(inventory = replace_na(inventory, 0))

print(sales_report_missing)

When integrating data from different sources, missing values can appear—such as when a product in the sales data does not have a corresponding inventory record. Strategies for dealing with missing values include replacing them with default values (like 0 for inventory), flagging them for review, or excluding incomplete records from analysis. Choosing the right approach depends on your reporting requirements and the potential impact of missing data on your results.

Note

Document your join logic and any assumptions you make during data integration. This helps ensure your workflow is reproducible and understandable for others who may review or maintain your analysis in the future.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 4. Capítulo 3

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Deslize para mostrar o menu


              12345678910111213141516171819202122232425262728
            
# Sample data frames
sales <- data.frame(
  sale_id = 1:4,
  product_id = c(101, 102, 103, 101),
  customer_id = c(1001, 1002, 1003, 1002),
  quantity = c(2, 1, 5, 3)
)

products <- data.frame(
  product_id = c(101, 102, 103),
  product_name = c("Widget", "Gadget", "Thingamajig"),
  inventory = c(20, 15, 0)
)

customers <- data.frame(
  customer_id = c(1001, 1002, 1003),
  customer_name = c("Alice", "Bob", "Charlie"),
  region = c("East", "West", "East")
)

library(dplyr)

# Joining sales with products, then with customers
sales_report <- sales %>%
  left_join(products, by = "product_id") %>%
  left_join(customers, by = "customer_id")

print(sales_report)


              12345678910111213141516
            
library(tidyr)

# Simulate missing inventory data for a new product
products_missing <- data.frame(
  product_id = c(101, 102, 103, 104),
  product_name = c("Widget", "Gadget", "Thingamajig", "Doohickey"),
  inventory = c(20, 15, 0, NA)
)

# Join and handle missing inventory values
sales_report_missing <- sales %>%
  left_join(products_missing, by = "product_id") %>%
  left_join(customers, by = "customer_id") %>%
  mutate(inventory = replace_na(inventory, 0))

print(sales_report_missing)

Note

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 4. Capítulo 3

Real-World Data Integration Scenarios

1. Why is data integration important in analytics?

2. What are some challenges when joining multiple data frames?

3. How can you handle missing values after joining data?

Real-World Data Integration Scenarios

1. Why is data integration important in analytics?

2. What are some challenges when joining multiple data frames?

3. How can you handle missing values after joining data?