Aprende Rule Pruning and Redundancy Removal | Foundations of Rule-Based Machine Learning

When building rule-based machine learning models, you often generate a large set of rules to cover as many patterns in the data as possible. However, not all rules are equally valuable. Some rules may overlap, repeat the same logic, or capture noise rather than meaningful patterns. This can lead to models that are unnecessarily complex, harder to interpret, and more likely to overfit the training data. Pruning and redundancy removal are essential steps to address these issues. For example, consider these two rules:

If age > 30 and income > 50K then label = "yes";
If income > 50K and age > 30 then label = "yes";

Both rules express the same logic, so one is redundant. Keeping both adds no value but increases complexity. Similarly, rules that rarely apply or have low predictive accuracy may not help generalization and can also be pruned.


              1234567891011121314151617181920212223242526272829
            
# List of candidate rules as (rule, coverage, accuracy)
candidate_rules = [
    ("if age > 30 and income > 50K then label = 'yes'", 120, 0.95),
    ("if income > 50K and age > 30 then label = 'yes'", 120, 0.95),  # Redundant
    ("if age <= 30 then label = 'no'", 80, 0.80),
    ("if income < 20K then label = 'no'", 10, 0.30),  # Low quality
    ("if city == 'NY' then label = 'yes'", 15, 0.60)
]

def is_redundant(rule, rules_seen):
    normalized = rule.lower().replace(" ", "")
    return normalized in rules_seen

def prune_rules(rules, min_coverage=20, min_accuracy=0.7):
    pruned = []
    rules_seen = set()
    for rule, coverage, accuracy in rules:
        if coverage < min_coverage or accuracy < min_accuracy:
            continue  # Remove low-quality rules
        # Remove redundant rules (same logic, different order)
        normalized = rule.lower().replace(" ", "")
        if normalized not in rules_seen:
            pruned.append((rule, coverage, accuracy))
            rules_seen.add(normalized)
    return pruned

pruned_rules = prune_rules(candidate_rules)
for rule in pruned_rules:
    print(rule[0])

The pruning logic in the code above works in two main ways. First, it filters out rules that have low coverage or low accuracy, ensuring that only high-quality rules remain. Second, it checks for redundancy by normalizing the rule strings (removing spaces and making them lowercase) so that logically identical rules written in different ways are recognized as duplicates. By keeping only unique, high-quality rules, the resulting model is simpler and easier to interpret. This also helps the model generalize better to new data, as it avoids overfitting to noise or redundant patterns.

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 1. Capítulo 4

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Suggested prompts:

Can you explain how the normalization step detects redundant rules?

What would happen if the rules had different variable orders or extra spaces?

How can I adjust the pruning thresholds for coverage and accuracy?

Awesome!

Completion rate improved to 6.25

Desliza para mostrar el menú

If age > 30 and income > 50K then label = "yes";
If income > 50K and age > 30 then label = "yes";


              1234567891011121314151617181920212223242526272829
            
# List of candidate rules as (rule, coverage, accuracy)
candidate_rules = [
    ("if age > 30 and income > 50K then label = 'yes'", 120, 0.95),
    ("if income > 50K and age > 30 then label = 'yes'", 120, 0.95),  # Redundant
    ("if age <= 30 then label = 'no'", 80, 0.80),
    ("if income < 20K then label = 'no'", 10, 0.30),  # Low quality
    ("if city == 'NY' then label = 'yes'", 15, 0.60)
]

def is_redundant(rule, rules_seen):
    normalized = rule.lower().replace(" ", "")
    return normalized in rules_seen

def prune_rules(rules, min_coverage=20, min_accuracy=0.7):
    pruned = []
    rules_seen = set()
    for rule, coverage, accuracy in rules:
        if coverage < min_coverage or accuracy < min_accuracy:
            continue  # Remove low-quality rules
        # Remove redundant rules (same logic, different order)
        normalized = rule.lower().replace(" ", "")
        if normalized not in rules_seen:
            pruned.append((rule, coverage, accuracy))
            rules_seen.add(normalized)
    return pruned

pruned_rules = prune_rules(candidate_rules)
for rule in pruned_rules:
    print(rule[0])

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 1. Capítulo 4