Choosing the Features
When working with many features, you often don’t know which ones matter. You can train a model with all of them, check which features are unhelpful, and then retrain using only the impactful ones.
Why Remove Features from the Model?
Adding a feature unrelated to the target introduces noise and worsens predictions. Many useless features stack noise and reduce model quality further.
How to Know if the Features Are Good or Bad?
OLS provides statistical tests during training. Each feature gets a t-test result, shown in the summary() table, indicating whether it significantly affects the target.
123456789import pandas as pd import statsmodels.api as sm file_link='https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/heights_two_feature.csv' df = pd.read_csv(file_link) # Open the file X,y = df[['Father', 'Mother']], df['Height'] # Assign the variables X_tilde = sm.add_constant(X) # Create X_tilde regression_model = sm.OLS(y, X_tilde).fit() # Initialize and train an OLS object print(regression_model.summary()) #Get the summary
What we are interested in is the p-value for each feature.
In short, the lower p-value the higher confidence that the feature is impactful
In statistics, we set a significance level, usually 0.05. If a feature’s p-value exceeds this threshold, it is considered not impactful.
In practice, slightly higher p-values (just above 0.05) may still help the model. It is safer to test the model with and without such a feature. But if the p-value is very high (>0.4), you can remove it confidently.
p-value ranges from 0 to 1, so when we talk about low p-value we mean less than 0.05 and high p-value usually means greater than 0.3-0.5.
In our example, we got p-values for Mother's height and constant 0.087 and 0.051. If we remove features with a p-value > 0.05, we will get the result below(on the left).
Even visually, we can tell that the model with constant (on the right) is better, so it's better not to remove it from the model.
Small datasets often produce higher p-values (0.05–0.2) even for meaningful features. P-values reflect confidence: with more data, it becomes easier to distinguish truly impactful features from noisy ones.
How to Remove Bad Features?
You just need to remove the column related to the feature from X_tilde. This can be done using the following code:
X_tilde = X_tilde.drop(___, axis=1)
For example, to drop columns 'const' and 'Mother' you would use:
X_tilde = X_tilde.drop(['Mother', 'const'], axis=1)
And then create new OLS object using updated X_tilde:
regression_model=sm.OLS(y, X_tilde)
1. Which of the features you should KEEP?
2. Choose the INCORRECT statement.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Can you explain what a p-value actually represents in this context?
How do I interpret the summary table from the OLS regression?
What should I do if both features have p-values just above 0.05?
Awesome!
Completion rate improved to 5.26
Choosing the Features
Swipe to show menu
When working with many features, you often don’t know which ones matter. You can train a model with all of them, check which features are unhelpful, and then retrain using only the impactful ones.
Why Remove Features from the Model?
Adding a feature unrelated to the target introduces noise and worsens predictions. Many useless features stack noise and reduce model quality further.
How to Know if the Features Are Good or Bad?
OLS provides statistical tests during training. Each feature gets a t-test result, shown in the summary() table, indicating whether it significantly affects the target.
123456789import pandas as pd import statsmodels.api as sm file_link='https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/heights_two_feature.csv' df = pd.read_csv(file_link) # Open the file X,y = df[['Father', 'Mother']], df['Height'] # Assign the variables X_tilde = sm.add_constant(X) # Create X_tilde regression_model = sm.OLS(y, X_tilde).fit() # Initialize and train an OLS object print(regression_model.summary()) #Get the summary
What we are interested in is the p-value for each feature.
In short, the lower p-value the higher confidence that the feature is impactful
In statistics, we set a significance level, usually 0.05. If a feature’s p-value exceeds this threshold, it is considered not impactful.
In practice, slightly higher p-values (just above 0.05) may still help the model. It is safer to test the model with and without such a feature. But if the p-value is very high (>0.4), you can remove it confidently.
p-value ranges from 0 to 1, so when we talk about low p-value we mean less than 0.05 and high p-value usually means greater than 0.3-0.5.
In our example, we got p-values for Mother's height and constant 0.087 and 0.051. If we remove features with a p-value > 0.05, we will get the result below(on the left).
Even visually, we can tell that the model with constant (on the right) is better, so it's better not to remove it from the model.
Small datasets often produce higher p-values (0.05–0.2) even for meaningful features. P-values reflect confidence: with more data, it becomes easier to distinguish truly impactful features from noisy ones.
How to Remove Bad Features?
You just need to remove the column related to the feature from X_tilde. This can be done using the following code:
X_tilde = X_tilde.drop(___, axis=1)
For example, to drop columns 'const' and 'Mother' you would use:
X_tilde = X_tilde.drop(['Mother', 'const'], axis=1)
And then create new OLS object using updated X_tilde:
regression_model=sm.OLS(y, X_tilde)
1. Which of the features you should KEEP?
2. Choose the INCORRECT statement.
Thanks for your feedback!