Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Filter Methods: SelectKBest | Feature Selection Strategies
Feature Selection and Regularization Techniques

bookFilter Methods: SelectKBest

Filter methods are a family of feature selection techniques that evaluate the relevance of each feature independently from the predictive model. These methods use statistical measures to score each feature based on its relationship with the target variable. Univariate feature selection is a type of filter method where each feature is evaluated individually using a univariate statistical test, making it a fast and scalable approach when you need to quickly reduce the dimensionality of your dataset before modeling.

123456789101112131415161718192021222324
import numpy as np from sklearn.datasets import make_regression from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression import pandas as pd # Generate a synthetic regression dataset X, y = make_regression(n_samples=100, n_features=8, noise=0.2, random_state=42) feature_names = [f"feature_{i}" for i in range(X.shape[1])] df = pd.DataFrame(X, columns=feature_names) # Select top 3 features using f_regression (ANOVA F-value) selector_f = SelectKBest(score_func=f_regression, k=3) X_new_f = selector_f.fit_transform(X, y) selected_features_f = [feature for feature, mask in zip(feature_names, selector_f.get_support()) if mask] # Select top 3 features using mutual_info_regression selector_mi = SelectKBest(score_func=mutual_info_regression, k=3) X_new_mi = selector_mi.fit_transform(X, y) selected_features_mi = [feature for feature, mask in zip(feature_names, selector_mi.get_support()) if mask] print("Top 3 features by f_regression:", selected_features_f) print("Top 3 features by mutual_info_regression:", selected_features_mi) print("f_regression scores:", selector_f.scores_) print("mutual_info_regression scores:", selector_mi.scores_)
copy

When you use SelectKBest, each feature receives a score based on its statistical relationship with the target variable. For regression, f_regression computes the ANOVA F-value for each feature, measuring linear dependency, while mutual_info_regression estimates the mutual information, capturing any dependency (not just linear). Higher scores indicate features that are more relevant for predicting the target. After fitting, you can inspect the .scores_ attribute to see the ranking of all features. You typically select the top k features with the highest scores, as shown above, and use them for further modeling. This process helps quickly identify and retain only the most informative features, reducing noise and improving model efficiency.

question mark

When should you consider using filter methods like SelectKBest for feature selection?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 2. Capítulo 1

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Suggested prompts:

Can you explain the difference between f_regression and mutual_info_regression in more detail?

How do I decide how many features (k) to select?

What are some limitations of using filter methods like SelectKBest?

Awesome!

Completion rate improved to 8.33

bookFilter Methods: SelectKBest

Deslize para mostrar o menu

Filter methods are a family of feature selection techniques that evaluate the relevance of each feature independently from the predictive model. These methods use statistical measures to score each feature based on its relationship with the target variable. Univariate feature selection is a type of filter method where each feature is evaluated individually using a univariate statistical test, making it a fast and scalable approach when you need to quickly reduce the dimensionality of your dataset before modeling.

123456789101112131415161718192021222324
import numpy as np from sklearn.datasets import make_regression from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression import pandas as pd # Generate a synthetic regression dataset X, y = make_regression(n_samples=100, n_features=8, noise=0.2, random_state=42) feature_names = [f"feature_{i}" for i in range(X.shape[1])] df = pd.DataFrame(X, columns=feature_names) # Select top 3 features using f_regression (ANOVA F-value) selector_f = SelectKBest(score_func=f_regression, k=3) X_new_f = selector_f.fit_transform(X, y) selected_features_f = [feature for feature, mask in zip(feature_names, selector_f.get_support()) if mask] # Select top 3 features using mutual_info_regression selector_mi = SelectKBest(score_func=mutual_info_regression, k=3) X_new_mi = selector_mi.fit_transform(X, y) selected_features_mi = [feature for feature, mask in zip(feature_names, selector_mi.get_support()) if mask] print("Top 3 features by f_regression:", selected_features_f) print("Top 3 features by mutual_info_regression:", selected_features_mi) print("f_regression scores:", selector_f.scores_) print("mutual_info_regression scores:", selector_mi.scores_)
copy

When you use SelectKBest, each feature receives a score based on its statistical relationship with the target variable. For regression, f_regression computes the ANOVA F-value for each feature, measuring linear dependency, while mutual_info_regression estimates the mutual information, capturing any dependency (not just linear). Higher scores indicate features that are more relevant for predicting the target. After fitting, you can inspect the .scores_ attribute to see the ranking of all features. You typically select the top k features with the highest scores, as shown above, and use them for further modeling. This process helps quickly identify and retain only the most informative features, reducing noise and improving model efficiency.

question mark

When should you consider using filter methods like SelectKBest for feature selection?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 2. Capítulo 1
some-alt