Aprende Implementing WoE Encoding in Python | Weight-of-Evidence and Leave-One-Out Encoding

To put Weight-of-Evidence (WoE) encoding into practice, you will work through a step-by-step guide to apply WoE encoding to a pandas DataFrame. WoE encoding is especially useful for binary classification tasks, where you want to transform categorical variables into numeric values that reflect the strength and direction of association with the target variable. You will first compute WoE values for each category in a column, then map those values back to the DataFrame.


              1234567891011121314151617181920212223242526
            
import pandas as pd
import numpy as np

# Sample data
df = pd.DataFrame({
    "feature": ["A", "B", "A", "C", "B", "A", "C", "B", "A", "C"],
    "target":  [1,   0,   1,   0,   1,   0,   0,   1,   1,   0]
})

# Calculate WoE values for each category in 'feature'
def calc_woe(df, feature, target):
    eps = 0.0001  # to avoid division by zero
    grouped = df.groupby(feature)[target]
    good = grouped.sum()
    bad = grouped.count() - good
    dist_good = good / good.sum()
    dist_bad = bad / bad.sum()
    woe = np.log((dist_good + eps) / (dist_bad + eps))
    return woe

woe_map = calc_woe(df, "feature", "target")

# Map WoE values back to the DataFrame
df["feature_woe"] = df["feature"].map(woe_map)

print(df)


              12345678910111213141516171819202122232425262728293031
            
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

# Custom WoE encoder as a scikit-learn transformer
class WoEEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, feature):
        self.feature = feature
        self.woe_map_ = None

    def fit(self, X, y):
        df = pd.DataFrame({self.feature: X[self.feature], "target": y})
        self.woe_map_ = calc_woe(df, self.feature, "target")
        return self

    def transform(self, X):
        X = X.copy()
        X[self.feature + "_woe"] = X[self.feature].map(self.woe_map_)
        return X[[self.feature + "_woe"]]

# Sample pipeline using WoE-encoded features
X = df[["feature"]]
y = df["target"]

pipeline = Pipeline([
    ("woe_encoder", WoEEncoder(feature="feature")),
    ("logreg", LogisticRegression(solver="liblinear"))
])

pipeline.fit(X, y)
print("WoE coefficients:", pipeline.named_steps["logreg"].coef_)

Understanding the Custom WoEEncoder and Pipeline Integration

The custom WoEEncoder class is built to work seamlessly with scikit-learn's pipeline architecture. This allows you to integrate Weight-of-Evidence encoding directly into your modeling workflow, ensuring that feature encoding and model training happen together.

How WoEEncoder Works:

Initialization:
- When you create an instance of WoEEncoder, you specify the name of the feature to encode (e.g., feature).
fit method:
- The fit method takes your feature matrix X and target vector y.
- It constructs a temporary DataFrame with the feature and target.
- Using the calc_woe function, it calculates WoE values for each category in the feature column, based on their relationship to the target.
- The resulting mapping (woe_map_) is stored as an instance variable for use during transformation.
transform method:
- The transform method creates a copy of your feature matrix.
- It maps the WoE values to the feature column, creating a new column with the suffix _woe (e.g., feature_woe).
- It returns only the WoE-encoded column as a DataFrame, which is then used for modeling.

Pipeline Integration:

By including WoEEncoder as the first step in a scikit-learn Pipeline, you ensure that WoE encoding is always applied to new data in the same way as during training.
The pipeline then passes the encoded data to a LogisticRegression model.
When you call pipeline.fit(X, y), the pipeline first applies WoE encoding to the feature, then trains the logistic regression model using the encoded values.
This approach keeps your preprocessing and modeling tightly coupled, reducing the risk of data leakage and ensuring reproducibility.

Summary:

The custom WoE encoder automates the calculation and application of WoE encoding.
Integrating WoEEncoder in a pipeline allows you to build robust, production-ready models that encode categorical features based on their predictive power for the target variable.

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 1. Capítulo 2

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Desliza para mostrar el menú


              1234567891011121314151617181920212223242526
            
import pandas as pd
import numpy as np

# Sample data
df = pd.DataFrame({
    "feature": ["A", "B", "A", "C", "B", "A", "C", "B", "A", "C"],
    "target":  [1,   0,   1,   0,   1,   0,   0,   1,   1,   0]
})

# Calculate WoE values for each category in 'feature'
def calc_woe(df, feature, target):
    eps = 0.0001  # to avoid division by zero
    grouped = df.groupby(feature)[target]
    good = grouped.sum()
    bad = grouped.count() - good
    dist_good = good / good.sum()
    dist_bad = bad / bad.sum()
    woe = np.log((dist_good + eps) / (dist_bad + eps))
    return woe

woe_map = calc_woe(df, "feature", "target")

# Map WoE values back to the DataFrame
df["feature_woe"] = df["feature"].map(woe_map)

print(df)


              12345678910111213141516171819202122232425262728293031
            
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

# Custom WoE encoder as a scikit-learn transformer
class WoEEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, feature):
        self.feature = feature
        self.woe_map_ = None

    def fit(self, X, y):
        df = pd.DataFrame({self.feature: X[self.feature], "target": y})
        self.woe_map_ = calc_woe(df, self.feature, "target")
        return self

    def transform(self, X):
        X = X.copy()
        X[self.feature + "_woe"] = X[self.feature].map(self.woe_map_)
        return X[[self.feature + "_woe"]]

# Sample pipeline using WoE-encoded features
X = df[["feature"]]
y = df["target"]

pipeline = Pipeline([
    ("woe_encoder", WoEEncoder(feature="feature")),
    ("logreg", LogisticRegression(solver="liblinear"))
])

pipeline.fit(X, y)
print("WoE coefficients:", pipeline.named_steps["logreg"].coef_)