Aprenda Filter Methods: SelectKBest | Feature Selection Strategies

Deslize para mostrar o menu

Filter methods are a family of feature selection techniques that evaluate the relevance of each feature independently from the predictive model. These methods use statistical measures to score each feature based on its relationship with the target variable. Univariate feature selection is a type of filter method where each feature is evaluated individually using a univariate statistical test, making it a fast and scalable approach when you need to quickly reduce the dimensionality of your dataset before modeling.


              123456789101112131415161718192021222324
            
import numpy as np
from sklearn.datasets import make_regression
from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression
import pandas as pd

# Generate a synthetic regression dataset
X, y = make_regression(n_samples=100, n_features=8, noise=0.2, random_state=42)
feature_names = [f"feature_{i}" for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=feature_names)

# Select top 3 features using f_regression (ANOVA F-value)
selector_f = SelectKBest(score_func=f_regression, k=3)
X_new_f = selector_f.fit_transform(X, y)
selected_features_f = [feature for feature, mask in zip(feature_names, selector_f.get_support()) if mask]

# Select top 3 features using mutual_info_regression
selector_mi = SelectKBest(score_func=mutual_info_regression, k=3)
X_new_mi = selector_mi.fit_transform(X, y)
selected_features_mi = [feature for feature, mask in zip(feature_names, selector_mi.get_support()) if mask]

print("Top 3 features by f_regression:", selected_features_f)
print("Top 3 features by mutual_info_regression:", selected_features_mi)
print("f_regression scores:", selector_f.scores_)
print("mutual_info_regression scores:", selector_mi.scores_)

When you use SelectKBest, each feature receives a score based on its statistical relationship with the target variable. For regression, f_regression computes the ANOVA F-value for each feature, measuring linear dependency, while mutual_info_regression estimates the mutual information, capturing any dependency (not just linear). Higher scores indicate features that are more relevant for predicting the target. After fitting, you can inspect the .scores_ attribute to see the ranking of all features. You typically select the top k features with the highest scores, as shown above, and use them for further modeling. This process helps quickly identify and retain only the most informative features, reducing noise and improving model efficiency.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 2. Capítulo 1

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Seção 2. Capítulo 1