Learn Constructing Frequent Pattern Trees with the FP-Growth Algorithm | High Performance Rule Mining and Scale Optimization

Swipe to show menu

FP-Growth Overview

FP-Growth is a high-performance algorithm designed for mining frequent itemsets in large transactional datasets. Unlike Apriori, which generates candidate itemsets and repeatedly scans the database, FP-Growth builds a compact data structure called the FP-Tree (Frequent Pattern Tree) that allows frequent patterns to be mined without candidate generation. This results in significant speed and memory improvements, especially for datasets with many frequent patterns.

Tree Structure

The FP-Tree structure compresses the entire set of transactions into a prefix-tree format. Each transaction is processed by sorting its items in descending order of frequency and then inserting them into the tree, sharing common prefixes with existing paths. The root node is a null node that serves as an entry point, and each node in the tree represents an item along with a count of how many transactions share that prefix up to that point.

Node Linking

To facilitate efficient mining, all nodes representing the same item are connected via a linked list, forming what is called a header table. This allows you to quickly traverse all occurrences of a particular item in the tree. Each entry in the header table points to the first node in the tree for that item, and each node contains a link to the next node with the same item name.

Mining Patterns

Mining frequent patterns from the FP-Tree involves two main steps. First, you traverse the tree from the bottom up, following the links for each item in the header table to collect all prefix paths leading to that item. Then, you construct conditional FP-Trees for each item and recursively mine them to extract all frequent itemsets. This recursive process continues until no more frequent patterns can be found.

Example: Step-by-step FP-Tree Construction

Let us walk through the exact step-by-step construction using your example transactions, assuming a Minimum Support Threshold of 2 (meaning items must appear at least twice to be considered).

Suppose you have the following transactions:

- Transaction 1: A, B, D;
- Transaction 2: B, C, E;
- Transaction 3: A, B, C, E;
- Transaction 4: B, E.

Step 1: Scan the Database and Sort by Frequency

First, the algorithm reads the entire dataset to count how many times each individual item appears. Any item that falls below the minimum support threshold is discarded. Then, a global sorting order is established from highest to lowest frequency.

Global Item Count:

B appears 4 times

E appears 3 times

A appears 2 times

C appears 2 times

D appears 1 time (Discarded because it is less than the Minimum Support of 2)

The Final Ordering Rule: Every transaction will now be filtered and rearranged to strictly follow this descending frequency order:

$\mathbf{B} \longrightarrow \mathbf{E} \longrightarrow \mathbf{A} \longrightarrow \mathbf{C}$

Step 2: Reorder Each Transaction

Before inserting items into the tree, the algorithm rewrites each transaction. It strips out the infrequent item (D) and rearranges the remaining items to match the new global ordering rule.

- Transaction 1: A, B, D becomes B, A
- Transaction 2: B, C, E becomes B, E, C
- Transaction 3: A, B, C, E becomes B, E, A, C
- Transaction 4: B, E becomes B, E

Step 3: Insert Paths into the FP-Tree

Every FP-Tree begins with an empty placeholder called the Null Root Node. We now insert our newly reordered transactions into the tree one by one, incrementing counters when paths overlap and branching out when they diverge.

Insert Transaction 1 {B, A}: the tree creates a brand new branch straight down from the root.

Insert Transaction 2 {B, E, C}: the algorithm starts at the root and checks the first item, B. Since B already exists as a child of the root, the path overlaps! The algorithm moves to B and increments its count to 2. The next item is E. Since B does not have a child named E yet, the tree splits here and creates a new sub-branch for E, followed by C.

Insert Transaction 3 {B, E, A, C}: starting from the root, the path hits B (count increments to 3), then moves down to the existing E node (count increments to 2). The next item is A. Since E doesn't have a child named A, the tree creates a new split under E for (A:1) -> (C:1).

Insert Transaction 4 {B, E}: the algorithm flows down the most popular prefix path and B increments to 4, E increments to 3. The transaction ends here, so no new nodes are added.

(Null Root) | (B: 4) / \ (A: 1) (E: 3) / \ (C: 1) (A: 1) | (C: 1)


              12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
            
transactions = [
    ['A', 'B', 'D'],
    ['B', 'C', 'E'],
    ['A', 'B', 'C', 'E'],
    ['B', 'E']
]

# Step 1: Counting item frequency
from collections import defaultdict, Counter

item_counts = Counter()
for transaction in transactions:
    item_counts.update(transaction)

# Step 2: Removing infrequent items
min_support = 2
freq_items = {item for item, count in item_counts.items() if count >= min_support}

# Step 3: Sorting items in each transaction by frequency
def sort_transaction(transaction):
    return sorted([item for item in transaction if item in freq_items],
                  key=lambda item: (-item_counts[item], item))

sorted_transactions = [sort_transaction(t) for t in transactions]

# Step 4: Defining FP-Tree node class
class FPTreeNode:
    def __init__(self, item, count, parent):
        self.item = item
        self.count = count
        self.parent = parent
        self.children = {}
        self.link = None  # Node link for header table

# Step 5: Building the FP-Tree and header table
def build_fp_tree(transactions):
    root = FPTreeNode(None, 1, None)
    header_table = defaultdict(list)

    for transaction in transactions:
        current = root
        for item in transaction:
            if item not in current.children:
                current.children[item] = FPTreeNode(item, 1, current)
                header_table[item].append(current.children[item])
            else:
                current.children[item].count += 1
            current = current.children[item]
    # Linking nodes in header table
    for item, nodes in header_table.items():
        for i in range(len(nodes) - 1):
            nodes[i].link = nodes[i + 1]
    return root, header_table

fp_root, header_table = build_fp_tree(sorted_transactions)

# Step 6: Printing the FP-Tree
def print_tree(node, indent=0):
    if node.item is not None:
        print('  ' * indent + f"{node.item}:{node.count}")
    for child in node.children.values():
        print_tree(child, indent + 1)

print_tree(fp_root)

This code sample shows how to construct an FP-Tree from a list of transactions using Python. Here is a detailed explanation of each step:

Counting item frequencies:
- The code uses the Counter class from the collections module to count how many times each item appears across all transactions;
- This helps identify which items are frequent enough to be included in the tree.
Removing infrequent items:
- Items that do not meet the minimum support threshold are filtered out;
- The set freq_items contains only items that appear at least as many times as specified by min_support.
Sorting transactions by frequency:
- Each transaction is filtered to keep only frequent items;
- The items are then sorted in descending order of their global frequency (and alphabetically to break ties);
- Sorting maximizes prefix sharing in the FP-Tree, making the tree more compact.
Defining the FP-Tree node class:
- The FPTreeNode class represents each node in the tree;
- Each node stores the item name, count, a pointer to its parent, a dictionary of its children, and a link to the next node with the same item name (for header table traversal).
Building the FP-Tree and header table:
- The build_fp_tree function creates the tree by inserting each sorted transaction;
- Starting from the root, for each item in a transaction, the function checks if a child node exists—if not, it creates one and adds it to the header table;
- If a node already exists, its count is incremented;
- After building the tree, the code links all nodes with the same item name using the link attribute, enabling fast traversal via the header table.
Printing the FP-Tree structure:
- The print_tree function displays the tree by recursively printing each node and its count with indentation to show the tree structure;
- This output helps visualize how transactions are compressed into shared paths within the FP-Tree.

1. Which of the following is a key advantage of FP-Growth over the Apriori algorithm?

2. What is the structure of an FP-Tree node?

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 2. Chapter 1