Rule Pruning and Redundancy Removal
When building rule-based machine learning models, you often generate a large set of rules to cover as many patterns in the data as possible. However, not all rules are equally valuable. Some rules may overlap, repeat the same logic, or capture noise rather than meaningful patterns. This can lead to models that are unnecessarily complex, harder to interpret, and more likely to overfit the training data. Pruning and redundancy removal are essential steps to address these issues. For example, consider these two rules:
- If age > 30 and income > 50K then label = "yes";
- If income > 50K and age > 30 then label = "yes";
Both rules express the same logic, so one is redundant. Keeping both adds no value but increases complexity. Similarly, rules that rarely apply or have low predictive accuracy may not help generalization and can also be pruned.
1234567891011121314151617181920212223242526272829# List of candidate rules as (rule, coverage, accuracy) candidate_rules = [ ("if age > 30 and income > 50K then label = 'yes'", 120, 0.95), ("if income > 50K and age > 30 then label = 'yes'", 120, 0.95), # Redundant ("if age <= 30 then label = 'no'", 80, 0.80), ("if income < 20K then label = 'no'", 10, 0.30), # Low quality ("if city == 'NY' then label = 'yes'", 15, 0.60) ] def is_redundant(rule, rules_seen): normalized = rule.lower().replace(" ", "") return normalized in rules_seen def prune_rules(rules, min_coverage=20, min_accuracy=0.7): pruned = [] rules_seen = set() for rule, coverage, accuracy in rules: if coverage < min_coverage or accuracy < min_accuracy: continue # Remove low-quality rules # Remove redundant rules (same logic, different order) normalized = rule.lower().replace(" ", "") if normalized not in rules_seen: pruned.append((rule, coverage, accuracy)) rules_seen.add(normalized) return pruned pruned_rules = prune_rules(candidate_rules) for rule in pruned_rules: print(rule[0])
The pruning logic in the code above works in two main ways. First, it filters out rules that have low coverage or low accuracy, ensuring that only high-quality rules remain. Second, it checks for redundancy by normalizing the rule strings (removing spaces and making them lowercase) so that logically identical rules written in different ways are recognized as duplicates. By keeping only unique, high-quality rules, the resulting model is simpler and easier to interpret. This also helps the model generalize better to new data, as it avoids overfitting to noise or redundant patterns.
¡Gracias por tus comentarios!
Pregunte a AI
Pregunte a AI
Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla
Can you explain how the normalization step detects redundant rules?
What would happen if the rules had different variable orders or extra spaces?
How can I adjust the pruning thresholds for coverage and accuracy?
Awesome!
Completion rate improved to 6.25
Rule Pruning and Redundancy Removal
Desliza para mostrar el menú
When building rule-based machine learning models, you often generate a large set of rules to cover as many patterns in the data as possible. However, not all rules are equally valuable. Some rules may overlap, repeat the same logic, or capture noise rather than meaningful patterns. This can lead to models that are unnecessarily complex, harder to interpret, and more likely to overfit the training data. Pruning and redundancy removal are essential steps to address these issues. For example, consider these two rules:
- If age > 30 and income > 50K then label = "yes";
- If income > 50K and age > 30 then label = "yes";
Both rules express the same logic, so one is redundant. Keeping both adds no value but increases complexity. Similarly, rules that rarely apply or have low predictive accuracy may not help generalization and can also be pruned.
1234567891011121314151617181920212223242526272829# List of candidate rules as (rule, coverage, accuracy) candidate_rules = [ ("if age > 30 and income > 50K then label = 'yes'", 120, 0.95), ("if income > 50K and age > 30 then label = 'yes'", 120, 0.95), # Redundant ("if age <= 30 then label = 'no'", 80, 0.80), ("if income < 20K then label = 'no'", 10, 0.30), # Low quality ("if city == 'NY' then label = 'yes'", 15, 0.60) ] def is_redundant(rule, rules_seen): normalized = rule.lower().replace(" ", "") return normalized in rules_seen def prune_rules(rules, min_coverage=20, min_accuracy=0.7): pruned = [] rules_seen = set() for rule, coverage, accuracy in rules: if coverage < min_coverage or accuracy < min_accuracy: continue # Remove low-quality rules # Remove redundant rules (same logic, different order) normalized = rule.lower().replace(" ", "") if normalized not in rules_seen: pruned.append((rule, coverage, accuracy)) rules_seen.add(normalized) return pruned pruned_rules = prune_rules(candidate_rules) for rule in pruned_rules: print(rule[0])
The pruning logic in the code above works in two main ways. First, it filters out rules that have low coverage or low accuracy, ensuring that only high-quality rules remain. Second, it checks for redundancy by normalizing the rule strings (removing spaces and making them lowercase) so that logically identical rules written in different ways are recognized as duplicates. By keeping only unique, high-quality rules, the resulting model is simpler and easier to interpret. This also helps the model generalize better to new data, as it avoids overfitting to noise or redundant patterns.
¡Gracias por tus comentarios!