Learn Calculating Key Mining Metrics | Foundations of Association Rules and Transactional Analysis

Swipe to show menu

Support: Definition, Calculation, and Interpretation

Definition

Support measures how frequently a specific item or itemset appears in a transaction dataset.

Mathematically, support for an itemset is the proportion of transactions that contain all items in that set. In a retail context, support helps you understand how common a combination of products is among all checkouts.

Support(itemset) = Number of transactions containing itemset / Total number of transactions

A higher support value means the itemset is more prevalent in your data, making it a candidate for further analysis or promotions. For instance, if "bread and butter" appear together in 30 out of 100 transactions, the support for {"bread", "butter"} is 0.3.

Confidence: Formula, Meaning, and Practical Use

Definition

Confidence assesses the likelihood that a customer who buys item A will also buy item B.

It is calculated as the ratio of transactions containing both A and B to the number of transactions containing just A.

Confidence(A ⇒ B) = Number of transactions containing both A and B / Number of transactions containing A

In practice, confidence tells you how reliable the rule "if A, then B" is. If confidence is high, you can be more certain that customers who buy A will also buy B, which is useful for targeted recommendations or marketing.

Lift: Derivation, Measurement, and Importance

Definition

Lift evaluates how much more likely item B is to be purchased when item A is bought, compared to if purchases of A and B were independent events.

It is calculated by dividing the confidence of the rule by the support of the consequent (B):

Lift(A ⇒ B) = Confidence(A ⇒ B) / Support(B)

A lift value of 1 implies no association (independence), greater than 1 indicates a positive association, and less than 1 suggests a negative association. Lift is crucial because it adjusts for the popularity of the consequent, helping you distinguish truly meaningful relationships from coincidental ones.

Worked Example: Calculating Support, Confidence, and Lift

Suppose you have the following transaction data:

Transaction 1: Milk, Bread;
Transaction 2: Milk, Diaper, Beer, Bread;
Transaction 3: Milk, Diaper, Beer, Cola;
Transaction 4: Bread, Butter.

Let's calculate the metrics for the rule: Milk ⇒ Bread.

Support(Milk, Bread): Appears in Transactions 1 and 2 (2 out of 4) ⇒ 0.5;
Support(Milk): Appears in Transactions 1, 2, and 3 (3 out of 4) ⇒ 0.75;
Support(Bread): Appears in Transactions 1, 2, and 4 (3 out of 4) ⇒ 0.75.
Confidence(Milk ⇒ Bread): Support(Milk, Bread) / Support(Milk) = 0.5 / 0.75 = 0.6667;
Lift(Milk ⇒ Bread): Confidence(Milk ⇒ Bread) / Support(Bread) = 0.6667 / 0.75 = 0.8889.

This means that although "Milk" and "Bread" often appear together, buying "Milk" does not increase the likelihood of buying "Bread" compared to the baseline.


              123456789101112131415161718192021222324252627282930
            
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

# Sample transaction data
data = [
    ['Milk', 'Bread'],
    ['Milk', 'Diaper', 'Beer', 'Bread'],
    ['Milk', 'Diaper', 'Beer', 'Cola'],
    ['Bread', 'Butter']
]

# Converting to DataFrame with one-hot encoding
te = TransactionEncoder()
te_ary = te.fit(data).transform(data)
df = pd.DataFrame(te_ary, columns=te.columns_)

# Calculating support
support_milk = df['Milk'].mean()
support_bread = df['Bread'].mean()
support_milk_bread = (df['Milk'] & df['Bread']).mean()

# Calculating confidence for rule: Milk => Bread
confidence = support_milk_bread / support_milk

# Calculating lift for rule: Milk => Bread
lift = confidence / support_bread

print(f"Support (Milk & Bread): {support_milk_bread:.2f}")
print(f"Confidence (Milk ⇒ Bread): {confidence:.2f}")
print(f"Lift (Milk ⇒ Bread): {lift:.2f}")

This code demonstrates how to compute key association rule metrics—support, confidence, and lift—using a small transaction dataset. Here's how each part works:

Prepare Transaction Data:
- The data list contains transactions, each as a list of purchased items;
- Each sublist represents a single transaction (basket) from a customer.
One-Hot Encode Transactions:
- The code uses TransactionEncoder from the mlxtend library to convert the list of item lists into a format suitable for analysis;
- fit learns all unique items, and transform creates a boolean array (True if the item is present in the transaction, False otherwise);
- This array is converted into a pandas DataFrame, where each column is an item and each row is a transaction.
Calculate Support:
- support_milk = df['Milk'].mean() computes the proportion of transactions containing "Milk";
- support_bread = df['Bread'].mean() computes the proportion containing "Bread";
- support_milk_bread = (df['Milk'] & df['Bread']).mean() calculates the proportion of transactions containing both "Milk" and "Bread" (the intersection of the two columns).
Compute Confidence:
- confidence = support_milk_bread / support_milk calculates the confidence for the rule "Milk ⇒ Bread";
- This measures how often "Bread" is purchased when "Milk" is purchased.
Compute Lift:
- lift = confidence / support_bread calculates the lift for the rule;
- Lift compares the observed confidence to the expected confidence if "Milk" and "Bread" purchases were independent.
Print and Interpret Results:
- The code prints the support, confidence, and lift values for "Milk & Bread";
- A support of 0.50 means "Milk" and "Bread" are purchased together in half the transactions;
- A confidence of 0.67 means that when "Milk" is bought, "Bread" is also bought two-thirds of the time;
- A lift of 0.89 suggests that buying "Milk" actually makes buying "Bread" slightly less likely compared to random chance (since lift < 1).

1. Suppose in a dataset of 1,000 transactions, 150 transactions contain both "apples" and "bananas". What is the support for the itemset {"apples", "bananas"}?

2. Which statement best distinguishes confidence from lift in association rule mining?

Which statement best distinguishes confidence from lift in association rule mining?

Select the correct answer

Confidence measures how often items in B appear in transactions containing A, while lift compares this observed frequency to the expected occurrence rate if A and B were completely independent.

Lift measures the probability of items in B appearing given that item A has occurred, while confidence evaluates whether the frequency of the combined itemset is significantly higher than a random distribution.

Confidence scales the frequency of itemset B based on the overall popularity of item A, while lift ignores item popularity entirely and instead focuses purely on the absolute support of both itemsets.

Confidence and lift are always equal because they use identical transaction counts in their mathematical formulas, meaning any change in support affects both metrics in the exact same proportion.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 2