Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Feature Selection Basics | Section
Practice
Projects
Quizzes & Challenges
Quizzer
Challenges
/
Data Preprocessing and Feature Engineering

bookFeature Selection Basics

Feature selection improves model performance by keeping only the most relevant features, reducing complexity, and helping prevent overfitting caused by irrelevant or redundant data.

Note
Definition

Feature selection is the process of choosing a subset of input variables (features) from your data that are most relevant to the predictive modeling problem.

Feature selection methods include manual review and automated techniques. In classification tasks, use statistical tests to score features and select those most strongly related to the target variable.

The most popular feature selection methods fall into three categories:

  • Filter methods: Select features based on statistical measures, such as correlation coefficients or univariate tests, independently of any machine learning model;
  • Wrapper methods: Use a predictive model to evaluate different combinations of features, such as with recursive feature elimination (RFE), and select the subset that yields the best model performance;
  • Embedded methods: Perform feature selection as part of the model training process, like Lasso regularization, which automatically removes less important features by shrinking their coefficients to zero.

Each method balances trade-offs between computational cost, interpretability, and predictive power.

1234567891011121314151617181920212223242526
import pandas as pd import seaborn as sns from sklearn.feature_selection import SelectKBest, f_classif from sklearn.preprocessing import LabelEncoder # Load Titanic dataset train = sns.load_dataset('titanic') # Select numeric and categorical columns (excluding target) features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked'] X = train[features].copy() y = train['survived'] # Encode categorical features X['sex'] = LabelEncoder().fit_transform(X['sex'].astype(str)) X['embarked'] = LabelEncoder().fit_transform(X['embarked'].astype(str)) # Handle missing values by filling with median (for simplicity) X = X.fillna(X.median(numeric_only=True)) # Select top 5 features based on ANOVA F-value selector = SelectKBest(score_func=f_classif, k=5) X_new = selector.fit_transform(X, y) selected_features = X.columns[selector.get_support()] print("Selected features:", list(selected_features))
copy

In this example, you use SelectKBest from scikit-learn with the f_classif scoring function to select the five most relevant features—pclass, sex, parch, fare, and embarked—from the Titanic dataset. This method evaluates each feature individually using ANOVA F-values and selects those with the highest scores. It is effective for classification tasks because it focuses on features that best separate the target classes.

Note
Note

Selecting too many features, especially irrelevant ones, can lead to overfitting, where your model performs well on training data but poorly on new, unseen data. Careful feature selection helps to reduce this risk and leads to more robust models.

Feature selection is not only about improving accuracy—it also makes your models faster and easier to interpret. By focusing only on the most important features, you simplify your models and reduce the chance of learning noise from the data.

question mark

Which of the following statements about feature selection is true?

Select the correct answer

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 1. Kapitel 9

Spørg AI

expand

Spørg AI

ChatGPT

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

bookFeature Selection Basics

Stryg for at vise menuen

Feature selection improves model performance by keeping only the most relevant features, reducing complexity, and helping prevent overfitting caused by irrelevant or redundant data.

Note
Definition

Feature selection is the process of choosing a subset of input variables (features) from your data that are most relevant to the predictive modeling problem.

Feature selection methods include manual review and automated techniques. In classification tasks, use statistical tests to score features and select those most strongly related to the target variable.

The most popular feature selection methods fall into three categories:

  • Filter methods: Select features based on statistical measures, such as correlation coefficients or univariate tests, independently of any machine learning model;
  • Wrapper methods: Use a predictive model to evaluate different combinations of features, such as with recursive feature elimination (RFE), and select the subset that yields the best model performance;
  • Embedded methods: Perform feature selection as part of the model training process, like Lasso regularization, which automatically removes less important features by shrinking their coefficients to zero.

Each method balances trade-offs between computational cost, interpretability, and predictive power.

1234567891011121314151617181920212223242526
import pandas as pd import seaborn as sns from sklearn.feature_selection import SelectKBest, f_classif from sklearn.preprocessing import LabelEncoder # Load Titanic dataset train = sns.load_dataset('titanic') # Select numeric and categorical columns (excluding target) features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked'] X = train[features].copy() y = train['survived'] # Encode categorical features X['sex'] = LabelEncoder().fit_transform(X['sex'].astype(str)) X['embarked'] = LabelEncoder().fit_transform(X['embarked'].astype(str)) # Handle missing values by filling with median (for simplicity) X = X.fillna(X.median(numeric_only=True)) # Select top 5 features based on ANOVA F-value selector = SelectKBest(score_func=f_classif, k=5) X_new = selector.fit_transform(X, y) selected_features = X.columns[selector.get_support()] print("Selected features:", list(selected_features))
copy

In this example, you use SelectKBest from scikit-learn with the f_classif scoring function to select the five most relevant features—pclass, sex, parch, fare, and embarked—from the Titanic dataset. This method evaluates each feature individually using ANOVA F-values and selects those with the highest scores. It is effective for classification tasks because it focuses on features that best separate the target classes.

Note
Note

Selecting too many features, especially irrelevant ones, can lead to overfitting, where your model performs well on training data but poorly on new, unseen data. Careful feature selection helps to reduce this risk and leads to more robust models.

Feature selection is not only about improving accuracy—it also makes your models faster and easier to interpret. By focusing only on the most important features, you simplify your models and reduce the chance of learning noise from the data.

question mark

Which of the following statements about feature selection is true?

Select the correct answer

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 1. Kapitel 9
some-alt