Course Content
Association Rule Mining
Association Rule Mining
Apriori Algorithm
The Apriori algorithm is a classical algorithm in data mining used to discover frequent itemsets in transactional datasets. It employs a level-wise, breadth-first search strategy to explore the space of itemsets efficiently. The algorithm is based on the Apriori principle (it was discovered in the previous chapter), which allows the algorithm to prune the search space, reducing computational complexity.
Algorithm description
-
Set the minimum support and confidence thresholds: The user specifies the minimum support and confidence thresholds required for the itemsets and rules to be considered frequent and relevant, respectively;
-
Generate the candidate 1-itemset: Create a list of all unique items in the database. Each item is a candidate 1-itemset;
-
Calculate support for 1-itemsets: Calculate and compare each item's support against the minimum support threshold. Retain only those items that meet or exceed the threshold. These are the frequent 1-itemsets;
-
Generate candidate (k+1)-itemsets from frequent k-itemsets: Use the frequent k-itemsets found in the previous step to generate candidate (k+1)-itemsets. This is done by joining the frequent k-itemsets with themselves and pruning any itemsets that have subsets not present in the list of frequent k-itemsets, based on the Apriori Principle;
-
Calculate support for (k+1)-itemsets: Calculate and compare each itemset's support against the minimum support threshold. Retain those that meet or exceed the threshold. These are the frequent (k+1)-itemsets;
-
Repeat steps 4 and 5: Continue generating candidate itemsets and calculating their support, moving from k-itemsets to (k+1)-itemsets, until no more frequent itemsets can be found;
-
Generate association rules from frequent itemsets: For each frequent itemset, generate all possible rules by dividing the itemset into two non-empty subsets where one serves as the antecedent and the other as the consequent of the rule. Calculate the confidence of each rule. Retain rules that meet or exceed the minimum confidence threshold.
Example
Suppose we have a grocery store transaction database with the following transactions and a minimum support threshold of 50%:
Step 1: Calculate support for 1-itemsets;
- Support(Bread) = 4/5 = 80%
- Support(Milk) = 4/5 = 80%
- Support(Diaper) = 5/5 = 100%
- Support(Beer) = 3/5 = 60%
- Support(Eggs) = 1/5 = 20%
- Support(Cola) = 2/5 = 40%
Step 2: Prune itemsets with support less than 50%. Eggs and Cola are eliminated;
Step 3: Generate candidate 2-itemsets from the remaining items (Bread, Milk, Diaper, Beer);
Step 4: Calculate support for 2-itemsets and prune;
- Suppose we find that:
- Support(Bread, Milk) = 3/5 = 60%
- Support(Bread, Diaper) = 4/5 = 80%
- Support(Milk, Diaper) = 4/5 = 80%
- And so on for other combinations...
Step 5: Repeat the process for generating 3-itemsets from the frequent 2-itemsets, and so on, until no new frequent itemsets are found. For example, if we have frequent itemsets {Bread, Diaper}
and {Bread, Milk}
we can generate 3-itemset {Bread, Milk, Diaper}
with 3/5 = 60% support.
Thanks for your feedback!