Oppiskele RIPPER Algorithm: Growing and Pruning Rules

The RIPPER algorithm (Repeated Incremental Pruning to Produce Error Reduction) is a well-known rule learning technique designed for efficient and interpretable classification. It is particularly effective for datasets with many features or imbalanced classes. RIPPER operates in three main phases: rule growing, rule pruning, and rule optimization.

In the rule growing phase, the algorithm incrementally adds conditions (also called literals) to a rule, aiming to cover as many positive examples as possible while excluding negatives. This greedy process continues until the rule perfectly fits a subset of the data or cannot be improved further.

Next, the pruning phase refines each grown rule to avoid overfitting. Here, the algorithm removes conditions from the rule if doing so improves its performance on a separate validation set, typically by maximizing a rule quality metric such as accuracy or coverage. This step ensures that the rules generalize better to unseen data.

Finally, the optimization phase reviews the entire rule set, attempting to reorder, revise, or replace rules to further reduce errors on the training set. This can involve re-growing and pruning rules or discarding ineffective ones.

The key steps of RIPPER can be summarized as:

Separate the data into positive and negative classes;
Grow a rule by greedily adding conditions to cover positive examples and exclude negatives;
Prune the rule to maximize generalization;
Add the pruned rule to the rule set and remove covered examples;
Repeat the process until no positive examples remain;
Optimize the rule set by revisiting and potentially improving individual rules.


              12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
            
import pandas as pd

# Sample dataset: simple binary classification
data = pd.DataFrame({
    "age": [22, 25, 47, 52, 46, 56, 55, 60],
    "income": [25000, 40000, 43000, 60000, 52000, 80000, 62000, 100000],
    "student": [0, 1, 0, 0, 1, 0, 1, 0],
    "label":   [0, 0, 1, 1, 1, 1, 1, 1]
})

def grow_rule(df, target_col):
    # Start with no conditions
    conditions = []
    remaining = df.copy()
    # Try adding conditions greedily
    while True:
        best_gain = 0
        best_cond = None
        best_subset = None
        for feature in ["age", "income", "student"]:
            if feature in [c[0] for c in conditions]:
                continue
            for threshold in sorted(remaining[feature].unique()):
                for op in ["<=", ">"]:
                    if op == "<=":
                        subset = remaining[remaining[feature] <= threshold]
                    else:
                        subset = remaining[remaining[feature] > threshold]
                    if subset.empty:
                        continue
                    # Rule must improve coverage of positives and reduce negatives
                    pos = subset[target_col].sum()
                    neg = len(subset) - pos
                    if pos == 0:
                        continue
                    gain = pos - neg
                    if gain > best_gain:
                        best_gain = gain
                        best_cond = (feature, op, threshold)
                        best_subset = subset
        if best_cond:
            conditions.append(best_cond)
            remaining = best_subset
            # Stop if only positives left
            if remaining[target_col].sum() == len(remaining):
                break
        else:
            break
    return conditions, remaining

def prune_rule(df, conditions, target_col):
    # Start with the full rule, try removing one condition at a time
    best_conditions = conditions.copy()
    best_score = score_rule(df, best_conditions, target_col)
    for i in range(len(conditions)):
        pruned = best_conditions[:i] + best_conditions[i+1:]
        score = score_rule(df, pruned, target_col)
        if score >= best_score:
            best_score = score
            best_conditions = pruned
    return best_conditions

def score_rule(df, conditions, target_col):
    if not conditions:
        return 0
    mask = pd.Series([True] * len(df))
    for feature, op, threshold in conditions:
        if op == "<=":
            mask &= df[feature] <= threshold
        else:
            mask &= df[feature] > threshold
    covered = df[mask]
    if covered.empty:
        return 0
    pos = covered[target_col].sum()
    neg = len(covered) - pos
    # Use precision as score
    return pos / len(covered)

# Grow and prune one rule
conditions, covered = grow_rule(data, "label")
pruned_conditions = prune_rule(data, conditions, "label")

print("Grown rule conditions:", conditions)
print("Pruned rule conditions:", pruned_conditions)

The code above demonstrates how RIPPER's main steps are applied in practice. The grow_rule function incrementally adds conditions to build a rule that covers as many positive examples as possible, avoiding negatives. At each step, it searches for the best feature, operator, and threshold that maximizes the difference between positives and negatives, which reflects the greedy growing phase of RIPPER.

After growing a rule, the prune_rule function attempts to remove conditions if doing so maintains or improves the rule's precision. This mirrors RIPPER's pruning phase, where the goal is to simplify the rule without sacrificing accuracy. Pruning is essential because rules that are too specific may fit the training data perfectly but fail to generalize to new, unseen data. By pruning unnecessary conditions, you reduce the risk of overfitting and encourage the discovery of more robust, interpretable rules.

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 2. Luku 2

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Suggested prompts:

Can you explain how the grow_rule function works in more detail?

What does the output of the code mean in terms of the dataset?

How does pruning improve the rule in this example?

Awesome!

Completion rate improved to 6.25

Pyyhkäise näyttääksesi valikon

The key steps of RIPPER can be summarized as:

Separate the data into positive and negative classes;
Grow a rule by greedily adding conditions to cover positive examples and exclude negatives;
Prune the rule to maximize generalization;
Add the pruned rule to the rule set and remove covered examples;
Repeat the process until no positive examples remain;
Optimize the rule set by revisiting and potentially improving individual rules.


              12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
            
import pandas as pd

# Sample dataset: simple binary classification
data = pd.DataFrame({
    "age": [22, 25, 47, 52, 46, 56, 55, 60],
    "income": [25000, 40000, 43000, 60000, 52000, 80000, 62000, 100000],
    "student": [0, 1, 0, 0, 1, 0, 1, 0],
    "label":   [0, 0, 1, 1, 1, 1, 1, 1]
})

def grow_rule(df, target_col):
    # Start with no conditions
    conditions = []
    remaining = df.copy()
    # Try adding conditions greedily
    while True:
        best_gain = 0
        best_cond = None
        best_subset = None
        for feature in ["age", "income", "student"]:
            if feature in [c[0] for c in conditions]:
                continue
            for threshold in sorted(remaining[feature].unique()):
                for op in ["<=", ">"]:
                    if op == "<=":
                        subset = remaining[remaining[feature] <= threshold]
                    else:
                        subset = remaining[remaining[feature] > threshold]
                    if subset.empty:
                        continue
                    # Rule must improve coverage of positives and reduce negatives
                    pos = subset[target_col].sum()
                    neg = len(subset) - pos
                    if pos == 0:
                        continue
                    gain = pos - neg
                    if gain > best_gain:
                        best_gain = gain
                        best_cond = (feature, op, threshold)
                        best_subset = subset
        if best_cond:
            conditions.append(best_cond)
            remaining = best_subset
            # Stop if only positives left
            if remaining[target_col].sum() == len(remaining):
                break
        else:
            break
    return conditions, remaining

def prune_rule(df, conditions, target_col):
    # Start with the full rule, try removing one condition at a time
    best_conditions = conditions.copy()
    best_score = score_rule(df, best_conditions, target_col)
    for i in range(len(conditions)):
        pruned = best_conditions[:i] + best_conditions[i+1:]
        score = score_rule(df, pruned, target_col)
        if score >= best_score:
            best_score = score
            best_conditions = pruned
    return best_conditions

def score_rule(df, conditions, target_col):
    if not conditions:
        return 0
    mask = pd.Series([True] * len(df))
    for feature, op, threshold in conditions:
        if op == "<=":
            mask &= df[feature] <= threshold
        else:
            mask &= df[feature] > threshold
    covered = df[mask]
    if covered.empty:
        return 0
    pos = covered[target_col].sum()
    neg = len(covered) - pos
    # Use precision as score
    return pos / len(covered)

# Grow and prune one rule
conditions, covered = grow_rule(data, "label")
pruned_conditions = prune_rule(data, conditions, "label")

print("Grown rule conditions:", conditions)
print("Pruned rule conditions:", pruned_conditions)

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 2. Luku 2