Apprendre Rule-Based Pipelines for Tabular ML Tasks | Hybrid and Applied Rule-Based Forecasting

Building a machine learning pipeline for tabular data using rule-based models involves several interconnected stages. Each stage plays a crucial role in ensuring the model is interpretable, effective, and robust. The main components of such a pipeline are preprocessing, rule generation, prediction, and evaluation. Preprocessing prepares the data by handling missing values, encoding categorical features, and scaling numerical values. Rule generation creates interpretable rules from the preprocessed data, often using algorithms like RIPPER or by mining frequent patterns. Prediction uses the generated rules to classify or regress new data points. Finally, evaluation assesses the model's accuracy and generalization using metrics such as accuracy, precision, recall, or the confusion matrix.


              123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
            
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.base import BaseEstimator, ClassifierMixin

# Custom simple rule-based classifier
class SimpleRuleBasedClassifier(BaseEstimator, ClassifierMixin):
    def fit(self, X, y):
        self.classes_ = y.unique()  # store all class labels
        self.rules_ = {}
        for col in X.columns:
            self.rules_[col] = {}
            for val in X[col].unique():
                most_common = y[X[col] == val].mode()[0]
                self.rules_[col][val] = most_common
        self.default_ = y.mode()[0]
        return self

    def predict(self, X):
        preds = []
        for _, row in X.iterrows():
            votes = []
            for col in X.columns:
                val = row[col]
                if val in self.rules_[col]:
                    votes.append(self.rules_[col][val])
            pred = max(set(votes), key=votes.count) if votes else self.default_
            preds.append(pred)
        return pd.Series(preds, index=X.index)

# Data
data = pd.DataFrame({
    "age": [22, 35, 58, 44, 25, 33, 60, 48],
    "income": ["low", "high", "high", "medium", "low", "medium", "high", "medium"],
    "owns_house": ["no", "yes", "yes", "no", "no", "yes", "yes", "no"],
    "target": ["no", "yes", "yes", "no", "no", "yes", "yes", "no"]
})

# Preprocess
le_income = LabelEncoder()
data["income"] = le_income.fit_transform(data["income"])
le_house = LabelEncoder()
data["owns_house"] = le_house.fit_transform(data["owns_house"])

scaler = StandardScaler()
data["age"] = scaler.fit_transform(data[["age"]])

# Split
X = data[["age", "income", "owns_house"]]
y = data["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Train
rule_model = SimpleRuleBasedClassifier()
rule_model.fit(X_train, y_train)

# Predict
y_pred = rule_model.predict(X_test)

# Evaluation (fixed)
all_labels = ["no", "yes"]

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred, labels=all_labels))
print("\nClassification Report:\n", classification_report(y_test, y_pred, labels=all_labels, zero_division=0))

To understand how each stage contributes to the pipeline, start with preprocessing. This stage transforms raw tabular data into a suitable format for rule extraction. In the code, categorical features like income and owns_house are encoded as integers using LabelEncoder, while the age feature is scaled using StandardScaler. This ensures that the rule generation algorithm can process the data efficiently.

The rule generation stage follows, where the SimpleRuleBasedClassifier constructs a set of rules from the training data. For each feature, the classifier records the most frequent target class for each value, creating a simple but interpretable mapping. If a new sample matches known feature values, the classifier uses majority voting among applicable rules; otherwise, it falls back on the most common target class.

Prediction applies these learned rules to unseen data. The classifier predicts the class for each test instance based on the rules generated during training. This approach maintains transparency, as each prediction can be traced back to specific feature-value associations.

Finally, the evaluation stage measures how well the pipeline performs. Metrics such as accuracy, the confusion matrix, and the classification report provide insight into the model's strengths and weaknesses. This feedback is essential for refining the rules, improving preprocessing, or adjusting the pipeline to better fit the data.

By structuring your workflow into these stages, you ensure that your rule-based ML pipeline is both interpretable and effective for tabular data tasks.

1. Which of the following is NOT a standard stage in a rule-based machine learning pipeline for tabular data?

2. When evaluating a rule-based pipeline for tabular classification, which metric would you use to measure the proportion of correctly predicted samples?

Tout était clair ?

Merci pour vos commentaires !

Section 3. Chapitre 4

Demandez à l'IA

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion