Lære Leave-One-Out Encoding: Principles and Application | Weight-of-Evidence and Leave-One-Out Encoding

Sveip for å vise menyen

Leave-one-out encoding transforms categorical variables into numerical values for supervised learning tasks. You encode each category by calculating the mean of the target variable for all rows except the current one.

How Leave-One-Out Encoding Works

For each row, compute the mean target value for its category, excluding that row;
Assign this calculated mean as the encoded value for the category in that row.

This method reduces the risk of target leakage compared to simple mean encoding, which uses the mean target value for each category across the entire dataset—including the current row. By excluding the current observation, leave-one-out encoding prevents the encoded value from being influenced by the target of that specific sample, making the model less prone to overfitting.

Comparison to Mean Encoding

Mean encoding replaces each category with the mean of the target variable for that category, including the current row;
This can introduce information from the target into the feature set, especially during model training;
Leave-one-out encoding avoids this by using only information from other samples, creating a more realistic simulation of how the model will perform on new, unseen data.


              123456789101112131415161718192021
            
import pandas as pd

def leave_one_out_encode(df, categorical_col, target_col):
    # Calculate sum and count for each category
    category_sum = df.groupby(categorical_col)[target_col].transform('sum')
    category_count = df.groupby(categorical_col)[target_col].transform('count')
    # Subtract the current row's target and count to leave one out
    loo_sum = category_sum - df[target_col]
    loo_count = category_count - 1
    # Avoid division by zero
    encoded = loo_sum / loo_count.replace(0, pd.NA)
    return encoded

# Example usage:
data = pd.DataFrame({
    'color': ['red', 'blue', 'red', 'green', 'blue', 'green'],
    'target': [1, 0, 1, 0, 1, 1]
})

data['color_loo'] = leave_one_out_encode(data, 'color', 'target')
print(data)

Study More

Leave-one-out encoding can help reduce variance compared to standard mean encoding, but it is not without risks. Especially with small datasets or rare categories, the encoded value may still be strongly influenced by individual target values, leading to potential overfitting. Regularization strategies, such as adding noise or smoothing, are often recommended to further reduce overfitting when using leave-one-out encoding.

Alt var klart?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 3

Spør AI

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 1. Kapittel 3