Feature Selection Basics
Feature selection improves model performance by keeping only the most relevant features, reducing complexity, and helping prevent overfitting caused by irrelevant or redundant data.
Feature selection is the process of choosing a subset of input variables (features) from your data that are most relevant to the predictive modeling problem.
Feature selection methods include manual review and automated techniques. In classification tasks, use statistical tests to score features and select those most strongly related to the target variable.
The most popular feature selection methods fall into three categories:
- Filter methods: Select features based on statistical measures, such as correlation coefficients or univariate tests, independently of any machine learning model;
- Wrapper methods: Use a predictive model to evaluate different combinations of features, such as with recursive feature elimination (RFE), and select the subset that yields the best model performance;
- Embedded methods: Perform feature selection as part of the model training process, like Lasso regularization, which automatically removes less important features by shrinking their coefficients to zero.
Each method balances trade-offs between computational cost, interpretability, and predictive power.
1234567891011121314151617181920212223242526import pandas as pd import seaborn as sns from sklearn.feature_selection import SelectKBest, f_classif from sklearn.preprocessing import LabelEncoder # Load Titanic dataset train = sns.load_dataset('titanic') # Select numeric and categorical columns (excluding target) features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked'] X = train[features].copy() y = train['survived'] # Encode categorical features X['sex'] = LabelEncoder().fit_transform(X['sex'].astype(str)) X['embarked'] = LabelEncoder().fit_transform(X['embarked'].astype(str)) # Handle missing values by filling with median (for simplicity) X = X.fillna(X.median(numeric_only=True)) # Select top 5 features based on ANOVA F-value selector = SelectKBest(score_func=f_classif, k=5) X_new = selector.fit_transform(X, y) selected_features = X.columns[selector.get_support()] print("Selected features:", list(selected_features))
In this example, you use SelectKBest from scikit-learn with the f_classif scoring function to select the five most relevant features—pclass, sex, parch, fare, and embarked—from the Titanic dataset. This method evaluates each feature individually using ANOVA F-values and selects those with the highest scores. It is effective for classification tasks because it focuses on features that best separate the target classes.
Selecting too many features, especially irrelevant ones, can lead to overfitting, where your model performs well on training data but poorly on new, unseen data. Careful feature selection helps to reduce this risk and leads to more robust models.
Feature selection is not only about improving accuracy—it also makes your models faster and easier to interpret. By focusing only on the most important features, you simplify your models and reduce the chance of learning noise from the data.
Bedankt voor je feedback!
Vraag AI
Vraag AI
Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.
Geweldig!
Completion tarief verbeterd naar 8.33
Feature Selection Basics
Veeg om het menu te tonen
Feature selection improves model performance by keeping only the most relevant features, reducing complexity, and helping prevent overfitting caused by irrelevant or redundant data.
Feature selection is the process of choosing a subset of input variables (features) from your data that are most relevant to the predictive modeling problem.
Feature selection methods include manual review and automated techniques. In classification tasks, use statistical tests to score features and select those most strongly related to the target variable.
The most popular feature selection methods fall into three categories:
- Filter methods: Select features based on statistical measures, such as correlation coefficients or univariate tests, independently of any machine learning model;
- Wrapper methods: Use a predictive model to evaluate different combinations of features, such as with recursive feature elimination (RFE), and select the subset that yields the best model performance;
- Embedded methods: Perform feature selection as part of the model training process, like Lasso regularization, which automatically removes less important features by shrinking their coefficients to zero.
Each method balances trade-offs between computational cost, interpretability, and predictive power.
1234567891011121314151617181920212223242526import pandas as pd import seaborn as sns from sklearn.feature_selection import SelectKBest, f_classif from sklearn.preprocessing import LabelEncoder # Load Titanic dataset train = sns.load_dataset('titanic') # Select numeric and categorical columns (excluding target) features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked'] X = train[features].copy() y = train['survived'] # Encode categorical features X['sex'] = LabelEncoder().fit_transform(X['sex'].astype(str)) X['embarked'] = LabelEncoder().fit_transform(X['embarked'].astype(str)) # Handle missing values by filling with median (for simplicity) X = X.fillna(X.median(numeric_only=True)) # Select top 5 features based on ANOVA F-value selector = SelectKBest(score_func=f_classif, k=5) X_new = selector.fit_transform(X, y) selected_features = X.columns[selector.get_support()] print("Selected features:", list(selected_features))
In this example, you use SelectKBest from scikit-learn with the f_classif scoring function to select the five most relevant features—pclass, sex, parch, fare, and embarked—from the Titanic dataset. This method evaluates each feature individually using ANOVA F-values and selects those with the highest scores. It is effective for classification tasks because it focuses on features that best separate the target classes.
Selecting too many features, especially irrelevant ones, can lead to overfitting, where your model performs well on training data but poorly on new, unseen data. Careful feature selection helps to reduce this risk and leads to more robust models.
Feature selection is not only about improving accuracy—it also makes your models faster and easier to interpret. By focusing only on the most important features, you simplify your models and reduce the chance of learning noise from the data.
Bedankt voor je feedback!