Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
学ぶ Choosing the Features | Section
Supervised Learning Essentials

bookChoosing the Features

メニューを表示するにはスワイプしてください

When working with many features, you often don’t know which ones matter. You can train a model with all of them, check which features are unhelpful, and then retrain using only the impactful ones.

Why Remove Features from the Model?

Adding a feature unrelated to the target introduces noise and worsens predictions. Many useless features stack noise and reduce model quality further.

How to Know if the Features Are Good or Bad?

To evaluate whether features significantly affect the target, we can compute the p-values for each feature. A low p-value suggests that the feature is statistically significant.

1234567891011121314151617
import pandas as pd from sklearn.feature_selection import f_regression file_link = 'https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/heights_two_feature.csv' df = pd.read_csv(file_link) X, y = df[['Father', 'Mother']], df['Height'] # f_regression returns F-statistic and p-values f_stat, p_values = f_regression(X, y) # Create a DataFrame to view results nicely results = pd.DataFrame({ 'Feature': X.columns, 'p_value': p_values }) print(results)
copy

In short, the lower the p-value, the higher the confidence that the feature is impactful. Typically, a p-value less than 0.05 is considered statistically significant.

In the example above:

  • Father: (extremely small, highly significant)
  • Mother: (very small, highly significant)

Both features are good predictors for the target.

In statistics, we set a significance level, usually 0.05. If a feature’s p-value exceeds this threshold, it is considered not impactful.

In practice, slightly higher p-values (just above 0.05) may still help the model. It is safer to test the model with and without such a feature. But if the p-value is very high (>0.4), you can remove it confidently.

Note
Note

p-value ranges from 0 to 1, so when we talk about low p-value we mean less than 0.05 and high p-value usually means greater than 0.3-0.5.

In our example, we got p-values for Mother's height and constant 0.087 and 0.051. If we remove features with a p-value > 0.05, we will get the result below(on the left).

Even visually, we can tell that the model with constant (on the right) is better, so it's better not to remove it from the model.

Note
Note

Small datasets often produce higher p-values (0.05–0.2) even for meaningful features. P-values reflect confidence: with more data, it becomes easier to distinguish truly impactful features from noisy ones.

How to Remove Bad Features?

You just need to remove the column related to the feature from X_tilde. This can be done using the following code:

X_tilde = X_tilde.drop(___, axis=1)

For example, to drop columns 'const' and 'Mother' you would use:

X_tilde = X_tilde.drop(['Mother', 'const'], axis=1)

And then create new OLS object using updated X_tilde:

regression_model=sm.OLS(y, X_tilde)

1. Which of the features you should KEEP?

2. Choose the INCORRECT statement.

question mark

Which of the features you should KEEP?

すべての正しい答えを選択

question mark

Choose the INCORRECT statement.

正しい答えを選んでください

すべて明確でしたか?

どのように改善できますか?

フィードバックありがとうございます!

セクション 1.  8

AIに質問する

expand

AIに質問する

ChatGPT

何でも質問するか、提案された質問の1つを試してチャットを始めてください

セクション 1.  8
some-alt