Lära Efficient Encoding Techniques | Encoding High-Cardinality Features and Best Practices

Svep för att visa menyn

When you work with datasets containing categorical features that have a large number of unique values—such as product IDs, user names, or zip codes—traditional encoding approaches like one-hot encoding can lead to inefficient, high-dimensional representations. This is known as the curse of high cardinality. To address this, you can use efficient encoding techniques that reduce dimensionality and help your models generalize better. Two practical approaches for high-cardinality features are frequency encoding and hashing encoding.

Frequency encoding replaces each category with the frequency of its occurrence in the dataset. This method is simple, effective, and preserves the information about the distribution of categories without inflating the feature space. With pandas, you can easily implement frequency encoding on a categorical variable.


              1234567891011121314
            
import pandas as pd

# Sample data with a high-cardinality categorical feature
df = pd.DataFrame({
    "user_id": ["A123", "B456", "C789", "A123", "B456", "D012", "A123"]
})

# Compute frequency of each category
freq = df["user_id"].value_counts(normalize=True)

# Map frequencies to the original column
df["user_id_freq_encoded"] = df["user_id"].map(freq)

print(df)

In this example, each user_id is replaced by its relative frequency in the dataset, resulting in a single numerical feature that can be used by most machine learning algorithms. This approach is especially useful when the number of unique values is too large for one-hot encoding.

Another powerful method is hashing encoding, sometimes called the "hashing trick." Instead of mapping each category to a unique integer or frequency, you use a hash function to map categories into a fixed number of bins. This reduces dimensionality and allows you to handle previously unseen categories, but it may introduce collisions where different categories share the same bin. Hashing encoding is efficient and scalable for very large datasets.


              1234567891011121314
            
import pandas as pd

# Sample data with a high-cardinality categorical feature
df = pd.DataFrame({
    "product_code": ["X001", "Y002", "Z003", "X001", "W004", "Y002", "V005"]
})

# Define number of bins for hashing
num_bins = 4

# Simple hash function using Python's built-in hash and modulo
df["product_code_hash"] = df["product_code"].apply(lambda x: hash(x) % num_bins)

print(df)

Here, each product_code is mapped into one of four bins using the Python hash function and the modulo operator. This approach is memory efficient and particularly suitable when you expect new categories in future data.

By using frequency encoding and hashing encoding, you can efficiently handle high-cardinality categorical features and keep your feature space manageable while maintaining useful information for your models.

Another efficient technique for high-cardinality categorical features is label encoding. With label encoding, each unique category is assigned a unique integer label. This method is simple to implement and keeps the feature space compact, making it suitable for algorithms that can handle categorical values as integers.

However, label encoding introduces an ordinal relationship between categories, which means models may interpret the encoded values as having an inherent order. This can be problematic if your categorical variable does not have a natural ranking. Use label encoding when your algorithm can handle arbitrary integer labels or when the categorical variable is truly ordinal.


              123456789101112131415161718
            
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample data with a categorical feature
colors = pd.Series(["red", "blue", "green", "blue", "red", "yellow", "green"])

# Initialize LabelEncoder and fit_transform the data
le = LabelEncoder()
labels = le.fit_transform(colors)

# Create a DataFrame to show the results
encoded_df = pd.DataFrame({
    "color": colors,
    "label_encoded": labels
})

print(encoded_df)
print("\nUnique categories:", list(le.classes_))

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 3. Kapitel 2

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 3. Kapitel 2