Learn Choosing the Features

Swipe to show menu

When working with many features, you often don’t know which ones matter. You can train a model with all of them, check which features are unhelpful, and then retrain using only the impactful ones.

Why Remove Features from the Model?

Adding a feature unrelated to the target introduces noise and worsens predictions. Many useless features stack noise and reduce model quality further.

How to Know if the Features Are Good or Bad?

To evaluate whether features significantly affect the target, we can compute the p-values for each feature. A low p-value suggests that the feature is statistically significant.


              1234567891011121314151617
            
import pandas as pd
from sklearn.feature_selection import f_regression

file_link = 'https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/heights_two_feature.csv'
df = pd.read_csv(file_link)
X, y = df[['Father', 'Mother']], df['Height']

# f_regression returns F-statistic and p-values
f_stat, p_values = f_regression(X, y)

# Create a DataFrame to view results nicely
results = pd.DataFrame({
    'Feature': X.columns,
    'p_value': p_values
})

print(results)

In short, the lower the p-value, the higher the confidence that the feature is impactful. Typically, a p-value less than 0.05 is considered statistically significant.

In the example above:

Father: (extremely small, highly significant)
Mother: (very small, highly significant)

Both features are good predictors for the target.

In statistics, we set a significance level, usually 0.05. If a feature’s p-value exceeds this threshold, it is considered not impactful.

In practice, slightly higher p-values (just above 0.05) may still help the model. It is safer to test the model with and without such a feature. But if the p-value is very high (>0.4), you can remove it confidently.

Note

p-value ranges from 0 to 1, so when we talk about low p-value we mean less than 0.05 and high p-value usually means greater than 0.3-0.5.

In our example, we got p-values for Mother's height and constant 0.087 and 0.051. If we remove features with a p-value > 0.05, we will get the result below(on the left).

Even visually, we can tell that the model with constant (on the right) is better, so it's better not to remove it from the model.

Note

Small datasets often produce higher p-values (0.05–0.2) even for meaningful features. P-values reflect confidence: with more data, it becomes easier to distinguish truly impactful features from noisy ones.

How to Remove Bad Features?

You just need to remove the column related to the feature from X_tilde. This can be done using the following code:

X_tilde = X_tilde.drop(___, axis=1)

For example, to drop columns 'const' and 'Mother' you would use:

X_tilde = X_tilde.drop(['Mother', 'const'], axis=1)

And then create new OLS object using updated X_tilde:

regression_model=sm.OLS(y, X_tilde)

1. Which of the features you should KEEP?

2. Choose the INCORRECT statement.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 8

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 8