Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Calculating Key Mining Metrics | Foundations of Association Rules and Transactional Analysis
Market Basket Analysis and Recommendation Systems

Calculating Key Mining Metrics

Swipe to show menu

Support: Definition, Calculation, and Interpretation

Note
Definition

Support measures how frequently a specific item or itemset appears in a transaction dataset.

Mathematically, support for an itemset is the proportion of transactions that contain all items in that set. In a retail context, support helps you understand how common a combination of products is among all checkouts.

Support(itemset) = Number of transactions containing itemset / Total number of transactions

A higher support value means the itemset is more prevalent in your data, making it a candidate for further analysis or promotions. For instance, if "bread and butter" appear together in 30 out of 100 transactions, the support for {"bread", "butter"} is 0.3.

Confidence: Formula, Meaning, and Practical Use

Note
Definition

Confidence assesses the likelihood that a customer who buys item A will also buy item B.

It is calculated as the ratio of transactions containing both A and B to the number of transactions containing just A.

Confidence(A ⇒ B) = Number of transactions containing both A and B / Number of transactions containing A

In practice, confidence tells you how reliable the rule "if A, then B" is. If confidence is high, you can be more certain that customers who buy A will also buy B, which is useful for targeted recommendations or marketing.

Lift: Derivation, Measurement, and Importance

Note
Definition

Lift evaluates how much more likely item B is to be purchased when item A is bought, compared to if purchases of A and B were independent events.

It is calculated by dividing the confidence of the rule by the support of the consequent (B):

Lift(A ⇒ B) = Confidence(A ⇒ B) / Support(B)

A lift value of 1 implies no association (independence), greater than 1 indicates a positive association, and less than 1 suggests a negative association. Lift is crucial because it adjusts for the popularity of the consequent, helping you distinguish truly meaningful relationships from coincidental ones.

Worked Example: Calculating Support, Confidence, and Lift

Suppose you have the following transaction data:

  • Transaction 1: Milk, Bread;
  • Transaction 2: Milk, Diaper, Beer, Bread;
  • Transaction 3: Milk, Diaper, Beer, Cola;
  • Transaction 4: Bread, Butter.

Let's calculate the metrics for the rule: Milk ⇒ Bread.

  • Support(Milk, Bread): Appears in Transactions 1 and 2 (2 out of 4) ⇒ 0.5;

  • Support(Milk): Appears in Transactions 1, 2, and 3 (3 out of 4) ⇒ 0.75;

  • Support(Bread): Appears in Transactions 1, 2, and 4 (3 out of 4) ⇒ 0.75.

  • Confidence(Milk ⇒ Bread): Support(Milk, Bread) / Support(Milk) = 0.5 / 0.75 = 0.6667;

  • Lift(Milk ⇒ Bread): Confidence(Milk ⇒ Bread) / Support(Bread) = 0.6667 / 0.75 = 0.8889.

This means that although "Milk" and "Bread" often appear together, buying "Milk" does not increase the likelihood of buying "Bread" compared to the baseline.

123456789101112131415161718192021222324252627282930
import pandas as pd from mlxtend.preprocessing import TransactionEncoder # Sample transaction data data = [ ['Milk', 'Bread'], ['Milk', 'Diaper', 'Beer', 'Bread'], ['Milk', 'Diaper', 'Beer', 'Cola'], ['Bread', 'Butter'] ] # Converting to DataFrame with one-hot encoding te = TransactionEncoder() te_ary = te.fit(data).transform(data) df = pd.DataFrame(te_ary, columns=te.columns_) # Calculating support support_milk = df['Milk'].mean() support_bread = df['Bread'].mean() support_milk_bread = (df['Milk'] & df['Bread']).mean() # Calculating confidence for rule: Milk => Bread confidence = support_milk_bread / support_milk # Calculating lift for rule: Milk => Bread lift = confidence / support_bread print(f"Support (Milk & Bread): {support_milk_bread:.2f}") print(f"Confidence (Milk ⇒ Bread): {confidence:.2f}") print(f"Lift (Milk ⇒ Bread): {lift:.2f}")

This code demonstrates how to compute key association rule metrics—support, confidence, and lift—using a small transaction dataset. Here's how each part works:

  1. Prepare Transaction Data:

    • The data list contains transactions, each as a list of purchased items;
    • Each sublist represents a single transaction (basket) from a customer.
  2. One-Hot Encode Transactions:

    • The code uses TransactionEncoder from the mlxtend library to convert the list of item lists into a format suitable for analysis;
    • fit learns all unique items, and transform creates a boolean array (True if the item is present in the transaction, False otherwise);
    • This array is converted into a pandas DataFrame, where each column is an item and each row is a transaction.
  3. Calculate Support:

    • support_milk = df['Milk'].mean() computes the proportion of transactions containing "Milk";
    • support_bread = df['Bread'].mean() computes the proportion containing "Bread";
    • support_milk_bread = (df['Milk'] & df['Bread']).mean() calculates the proportion of transactions containing both "Milk" and "Bread" (the intersection of the two columns).
  4. Compute Confidence:

    • confidence = support_milk_bread / support_milk calculates the confidence for the rule "Milk ⇒ Bread";
    • This measures how often "Bread" is purchased when "Milk" is purchased.
  5. Compute Lift:

    • lift = confidence / support_bread calculates the lift for the rule;
    • Lift compares the observed confidence to the expected confidence if "Milk" and "Bread" purchases were independent.
  6. Print and Interpret Results:

    • The code prints the support, confidence, and lift values for "Milk & Bread";
    • A support of 0.50 means "Milk" and "Bread" are purchased together in half the transactions;
    • A confidence of 0.67 means that when "Milk" is bought, "Bread" is also bought two-thirds of the time;
    • A lift of 0.89 suggests that buying "Milk" actually makes buying "Bread" slightly less likely compared to random chance (since lift < 1).

1. Suppose in a dataset of 1,000 transactions, 150 transactions contain both "apples" and "bananas". What is the support for the itemset {"apples", "bananas"}?

2. Which statement best distinguishes confidence from lift in association rule mining?

question-icon

Suppose in a dataset of 1,000 transactions, 150 transactions contain both "apples" and "bananas". What is the support for the itemset {"apples", "bananas"}?

question mark

Which statement best distinguishes confidence from lift in association rule mining?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 2

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 2
some-alt