Learn Processing Transactional Data Streams with the Apriori Algorithm | Foundations of Association Rules and Transactional Analysis

Swipe to show menu

The Apriori algorithm is a foundational tool for uncovering frequent itemsets in transactional data, which is essential for market basket analysis. At the heart of Apriori is the Apriori Principle, also known as the downward closure property.

Definition

Apriori Principle - If an itemset is frequent, then all of its subsets must also be frequent.

This insight dramatically reduces the number of itemsets that need to be evaluated, as any itemset containing an infrequent subset can be immediately discarded from consideration.

Candidate generation is the matchmaking phase of the algorithm. In each round, the algorithm looks at the winning frequent itemsets from the previous round and combines them to form new, larger groups called candidates. To do this efficiently, the algorithm only joins itemsets that are nearly identical, sharing all of their items except for the very last one.

How it works: If the algorithm discovers that {Bread, Milk} and {Bread, Diapers} are both highly popular on their own, it combines them to create a brand-new, three-item candidate for the next round: {Bread, Milk, Diapers}.

Immediately after generating new candidates, the algorithm initiates the Pruning phase to eliminate dead weight before running any heavy database calculations. This step relies entirely on the Apriori Principle, which states that a large itemset can only be frequent if all of its smaller individual pieces are also frequent. During pruning, the algorithm meticulously dissects each new candidate and looks at its smaller subsets. If even a single subset failed to make the cut in the previous round, the entire candidate is instantly deleted.

How it works: Imagine the candidate {Bread, Milk, Eggs} is generated. The algorithm checks its sub-components: {Bread, Milk}, {Bread, Eggs}, and {Milk, Eggs}. If your historical transaction logs show that {Milk, Eggs} is a rare combination that failed a previous round, the algorithm prunes {Bread, Milk, Eggs} entirely. This saves you from wasting computing power on a combination that is mathematically guaranteed to fail.

Once the candidate list has been pruned down to the most viable contenders, the algorithm moves to Frequent Itemset Identification. The system scans your raw transaction databases to count exactly how many times these remaining candidate groups appear together in real checkouts. Each candidate's total count is converted into a percentage. If this score meets or exceeds your pre-determined Minimum Support Threshold, the candidate is officially crowned a Frequent Itemset.

Definition

Minimum Support Threshold is a user-defined cutoff that determines the lowest frequency an itemset must have to be considered significant in the Apriori algorithm. By setting this threshold, you filter out itemsets that appear too infrequently in the dataset. Choosing a high threshold will result in fewer, more common itemsets, potentially missing interesting but rare combinations. Setting it too low may overwhelm you with too many itemsets, including those of little practical value. The threshold you select directly impacts the depth and focus of your analysis, so it should reflect the goals and scale of your data exploration.

This entire three-step loop repeats consecutively. It moves from single items to pairs, then to groups of three, and continues expanding until it runs a round where no new frequent itemsets can be found, successfully mapping your complete web of consumer purchasing behavior.

To make these ideas concrete, consider a simple Example using a small dataset. Suppose you have the following transactions:

Transaction 1: {Milk, Bread};
Transaction 2: {Milk, Diaper, Beer, Bread};
Transaction 3: {Milk, Diaper, Beer, Cola};
Transaction 4: {Diaper, Beer};
Transaction 5: {Milk, Diaper, Bread, Beer}.

The algorithm would first identify all single items that appear frequently, then generate candidate pairs from those, prune those that contain infrequent items, and continue this process to discover larger frequent groups, such as {Milk, Diaper, Beer}.


              1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
            
import pandas as pd
from itertools import combinations

# Example transaction dataset
transactions = [
    ['Milk', 'Bread'],
    ['Milk', 'Diaper', 'Beer', 'Bread'],
    ['Milk', 'Diaper', 'Beer', 'Cola'],
    ['Diaper', 'Beer'],
    ['Milk', 'Diaper', 'Bread', 'Beer']
]

# Convert transactions to a DataFrame of one-hot encoded items
all_items = sorted(set(item for transaction in transactions for item in transaction))
df = pd.DataFrame([{item: (item in transaction) for item in all_items} for transaction in transactions])

def apriori(df, min_support=0.6):

    def get_support(itemset):
        return (df[list(itemset)].all(axis=1).sum()) / len(df)

    # Step 1: Finding frequent 1-itemsets
    frequent_itemsets = []
    current_level = []
    for item in all_items:
        support = get_support([item])
        if support >= min_support:
            current_level.append(frozenset([item]))
            frequent_itemsets.append((frozenset([item]), support))

    k = 2
    while current_level:
        # Step 2: Candidate generation
        candidates = []
        level_list = list(current_level)
        for i in range(len(level_list)):
            for j in range(i+1, len(level_list)):
                union = level_list[i] | level_list[j]
                if len(union) == k and union not in candidates:
                    # Step 3: Pruning
                    subsets = list(combinations(union, k-1))
                    if all(frozenset(subset) in current_level for subset in subsets):
                        candidates.append(union)

        # Step 4: Support counting
        next_level = []
        for candidate in candidates:
            support = get_support(candidate)
            if support >= min_support:
                next_level.append(candidate)
                frequent_itemsets.append((candidate, support))

        current_level = next_level
        k += 1

    # Format output
    return [(set(itemset), round(support, 2)) for itemset, support in frequent_itemsets]

# Running Apriori and displaying frequent itemsets
frequent_itemsets = apriori(df, min_support=0.6)
for itemset, support in frequent_itemsets:
    print(f"Frequent Itemset: {itemset}, Support: {support}")

This Python code demonstrates a simple implementation of the Apriori algorithm for identifying frequent itemsets in transactional data. Here's how each step is carried out:

Data Preparation

The transactions are represented as a list of lists, where each sublist contains items purchased together.
All unique items are identified and sorted.
A one-hot encoded DataFrame is created: each row represents a transaction, and each column corresponds to an item, with True or False indicating presence or absence.

Apriori Algorithm Steps

1. Frequent 1-Itemset Identification

The algorithm calculates the support for each individual item (the proportion of transactions containing the item).
If an item’s support meets or exceeds the min_support threshold (here, 0.6), it is considered frequent and included in the first list of frequent itemsets.

2. Candidate Generation

For itemsets of length k (starting from 2), the algorithm generates new candidate itemsets by combining frequent itemsets from the previous round that share all but one item.
Only unique combinations are considered, ensuring no duplicates.

3. Pruning

Each new candidate's subsets of length k-1 are checked.
If any subset is not in the list of frequent itemsets from the previous iteration, the candidate is pruned (excluded), based on the Apriori Principle.

4. Support Counting

For each remaining candidate, the algorithm counts how many transactions contain all items in the candidate.
The support is calculated and compared to the threshold. Only candidates meeting the threshold are kept as frequent itemsets for the next round.

5. Iteration

Steps 2–4 repeat, increasing the size of itemsets each time, until no new frequent itemsets are found.

Output

The final list includes all frequent itemsets and their support values.
The code prints each frequent itemset and its support, showing which combinations of items commonly appear together in the transactions.

1. Which of the following best describes the candidate generation step in the Apriori algorithm?

2. What is the main reason for pruning candidate itemsets during the Apriori algorithm?

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 4

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 4