Lære Feature Selection Basics

Stryg for at vise menuen

Feature selection improves model performance by keeping only the most relevant features, reducing complexity, and helping prevent overfitting caused by irrelevant or redundant data.

Definition

Feature selection is the process of choosing a subset of input variables (features) from your data that are most relevant to the predictive modeling problem.

Feature selection methods include manual review and automated techniques. In classification tasks, use statistical tests to score features and select those most strongly related to the target variable.

The most popular feature selection methods fall into three categories:

Filter methods: Select features based on statistical measures, such as correlation coefficients or univariate tests, independently of any machine learning model;
Wrapper methods: Use a predictive model to evaluate different combinations of features, such as with recursive feature elimination (RFE), and select the subset that yields the best model performance;
Embedded methods: Perform feature selection as part of the model training process, like Lasso regularization, which automatically removes less important features by shrinking their coefficients to zero.

Each method balances trade-offs between computational cost, interpretability, and predictive power.


              1234567891011121314151617181920212223242526
            
import pandas as pd
import seaborn as sns
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import LabelEncoder

# Load Titanic dataset
train = sns.load_dataset('titanic')

# Select numeric and categorical columns (excluding target)
features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']
X = train[features].copy()
y = train['survived']

# Encode categorical features
X['sex'] = LabelEncoder().fit_transform(X['sex'].astype(str))
X['embarked'] = LabelEncoder().fit_transform(X['embarked'].astype(str))

# Handle missing values by filling with median (for simplicity)
X = X.fillna(X.median(numeric_only=True))

# Select top 5 features based on ANOVA F-value
selector = SelectKBest(score_func=f_classif, k=5)
X_new = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]

print("Selected features:", list(selected_features))

In this example, you use SelectKBest from scikit-learn with the f_classif scoring function to select the five most relevant features—pclass, sex, parch, fare, and embarked—from the Titanic dataset. This method evaluates each feature individually using ANOVA F-values and selects those with the highest scores. It is effective for classification tasks because it focuses on features that best separate the target classes.

Note

Selecting too many features, especially irrelevant ones, can lead to overfitting, where your model performs well on training data but poorly on new, unseen data. Careful feature selection helps to reduce this risk and leads to more robust models.

Feature selection is not only about improving accuracy—it also makes your models faster and easier to interpret. By focusing only on the most important features, you simplify your models and reduce the chance of learning noise from the data.

Var alt klart?

Tak for dine kommentarer!

Sektion 1. Kapitel 9

Spørg AI

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Sektion 1. Kapitel 9