Course Content
Association Rule Mining
Association Rule Mining
Exploratory Data Analysis
We have already discussed how association rule mining algorithms like Apriori and FP-growth can be applied in market basket analysis. However, ARM can also be utilized to address more specialized tasks. Now, we will provide a concise overview of additional tasks that can be tackled using ARM.
Association Rule Mining (ARM) can be utilized in classification and regression tasks to augment the exploratory data analysis (EDA) process and uncover latent patterns or relationships within our feature dataset.
By employing ARM, we can identify associations or "if-then" relationships among variables, which can be valuable for making predictions or deriving insights from the data.
Example
Let's consider a Heart Disease Classification dataset: it contains information about some medical features of the human organism. We will perform ARM to detect some hidden patterns in it:
import pandas as pd from sklearn.preprocessing import OneHotEncoder from mlxtend.frequent_patterns import apriori, association_rules import warnings # Ignore all warnings warnings.filterwarnings('ignore') # Load the heart dataset df = pd.read_csv('https://codefinity-content-media-v2.s3.eu-west-1.amazonaws.com/courses/a7e17f02-2cc9-4b92-abe0-cc8710d7011e/heart.csv') # Select features from the DataFrame selected_features = ['sex', 'cp', 'restecg', 'slope', 'ca', 'thal', 'fbs', 'target'] # Create a new DataFrame containing only the selected features df_selected = df[selected_features] # Perform one-hot encoding for 'cp', 'restecg', 'slope', 'thal', and 'ca' variables using sklearn's OneHotEncoder encoder = OneHotEncoder(drop='first', sparse=False) encoded_cols = ['cp', 'restecg', 'slope', 'thal', 'ca'] df_encoded_cols = pd.DataFrame(encoder.fit_transform(df[encoded_cols]), columns=encoder.get_feature_names_out(encoded_cols)) # Drop the original columns and replace them with the encoded ones df_encoded = df_selected.drop(columns=encoded_cols) df_encoded = pd.concat([df_encoded, df_encoded_cols], axis=1) # Mine frequent itemsets using Apriori algorithm frequent_itemsets = apriori(df_encoded, min_support=0.2, use_colnames=True) # Generate association rules association_rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.8) # Print antecedent, consequent, confidence, and lift metrics print('Association Rules:') print(association_rules[['antecedents', 'consequents', 'confidence', 'lift']])
What conclusions can we make?
-
If a patient has thalassemia type 3 (
thal_3
), they are likely to be male (sex
) with a confidence of 87.07%. This suggests a strong association betweenthal_3
and being male; -
If a patient has both slope type 2 (
slope_2
) and restecg type 1 (restecg_1
), they are likely to have a heart disease (target
) with a confidence of 80.36%. This indicates a strong association betweenslope_2
,restecg_1
, and having a heart disease; -
If a patient has both thalassemia type 2 (
thal_2
) and restecg type 1 (restecg_1
), they are likely to have a heart disease (target
) with a confidence of 84.75%. This suggests a strong association betweenthal_2
,restecg_1
, and having a heart disease; -
If a patient has both slope type 2 (
slope_2
) and thalassemia type 2 (thal_2
), they are likely to have a heart disease (target
) with a confidence of 85.45%. This indicates a strong association betweenslope_2
,thal_2
, and having a heart disease; -
All lift values are greater than 1 for the provided rules. This indicates that the antecedents and consequents occur together more frequently than expected if they were independent. In other words, the occurrence of the antecedents increases the likelihood of the consequents, suggesting a positive association between the variables.
Using rules 2-3, we can even perform rule-based classification - if the patient has some particular feature values - we can classify heart disease without using ML approaches.
Thanks for your feedback!