Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Choosing the Features | Multiple Linear Regression
Quizzes & Challenges
Quizzes
Challenges
/
Linear Regression with Python

bookChoosing the Features

When working with many features, you often don’t know which ones matter. You can train a model with all of them, check which features are unhelpful, and then retrain using only the impactful ones.

Why Remove Features from the Model?

Adding a feature unrelated to the target introduces noise and worsens predictions. Many useless features stack noise and reduce model quality further.

How to Know if the Features Are Good or Bad?

OLS provides statistical tests during training. Each feature gets a t-test result, shown in the summary() table, indicating whether it significantly affects the target.

123456789
import pandas as pd import statsmodels.api as sm file_link='https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/heights_two_feature.csv' df = pd.read_csv(file_link) # Open the file X,y = df[['Father', 'Mother']], df['Height'] # Assign the variables X_tilde = sm.add_constant(X) # Create X_tilde regression_model = sm.OLS(y, X_tilde).fit() # Initialize and train an OLS object print(regression_model.summary()) #Get the summary
copy

What we are interested in is the p-value for each feature.

In short, the lower p-value the higher confidence that the feature is impactful

In statistics, we set a significance level, usually 0.05. If a feature’s p-value exceeds this threshold, it is considered not impactful.

In practice, slightly higher p-values (just above 0.05) may still help the model. It is safer to test the model with and without such a feature. But if the p-value is very high (>0.4), you can remove it confidently.

Note
Note

p-value ranges from 0 to 1, so when we talk about low p-value we mean less than 0.05 and high p-value usually means greater than 0.3-0.5.

In our example, we got p-values for Mother's height and constant 0.087 and 0.051. If we remove features with a p-value > 0.05, we will get the result below(on the left).

Even visually, we can tell that the model with constant (on the right) is better, so it's better not to remove it from the model.

Note
Note

Small datasets often produce higher p-values (0.05–0.2) even for meaningful features. P-values reflect confidence: with more data, it becomes easier to distinguish truly impactful features from noisy ones.

How to Remove Bad Features?

You just need to remove the column related to the feature from X_tilde. This can be done using the following code:

X_tilde = X_tilde.drop(___, axis=1)

For example, to drop columns 'const' and 'Mother' you would use:

X_tilde = X_tilde.drop(['Mother', 'const'], axis=1)

And then create new OLS object using updated X_tilde:

regression_model=sm.OLS(y, X_tilde)

1. Which of the features you should KEEP?

2. Choose the INCORRECT statement.

question mark

Which of the features you should KEEP?

Select the correct answer

question mark

Choose the INCORRECT statement.

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 2. Chapter 4

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain what a p-value actually represents in this context?

How do I interpret the summary table from the OLS regression?

What should I do if both features have p-values just above 0.05?

Awesome!

Completion rate improved to 5.26

bookChoosing the Features

Swipe to show menu

When working with many features, you often don’t know which ones matter. You can train a model with all of them, check which features are unhelpful, and then retrain using only the impactful ones.

Why Remove Features from the Model?

Adding a feature unrelated to the target introduces noise and worsens predictions. Many useless features stack noise and reduce model quality further.

How to Know if the Features Are Good or Bad?

OLS provides statistical tests during training. Each feature gets a t-test result, shown in the summary() table, indicating whether it significantly affects the target.

123456789
import pandas as pd import statsmodels.api as sm file_link='https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/heights_two_feature.csv' df = pd.read_csv(file_link) # Open the file X,y = df[['Father', 'Mother']], df['Height'] # Assign the variables X_tilde = sm.add_constant(X) # Create X_tilde regression_model = sm.OLS(y, X_tilde).fit() # Initialize and train an OLS object print(regression_model.summary()) #Get the summary
copy

What we are interested in is the p-value for each feature.

In short, the lower p-value the higher confidence that the feature is impactful

In statistics, we set a significance level, usually 0.05. If a feature’s p-value exceeds this threshold, it is considered not impactful.

In practice, slightly higher p-values (just above 0.05) may still help the model. It is safer to test the model with and without such a feature. But if the p-value is very high (>0.4), you can remove it confidently.

Note
Note

p-value ranges from 0 to 1, so when we talk about low p-value we mean less than 0.05 and high p-value usually means greater than 0.3-0.5.

In our example, we got p-values for Mother's height and constant 0.087 and 0.051. If we remove features with a p-value > 0.05, we will get the result below(on the left).

Even visually, we can tell that the model with constant (on the right) is better, so it's better not to remove it from the model.

Note
Note

Small datasets often produce higher p-values (0.05–0.2) even for meaningful features. P-values reflect confidence: with more data, it becomes easier to distinguish truly impactful features from noisy ones.

How to Remove Bad Features?

You just need to remove the column related to the feature from X_tilde. This can be done using the following code:

X_tilde = X_tilde.drop(___, axis=1)

For example, to drop columns 'const' and 'Mother' you would use:

X_tilde = X_tilde.drop(['Mother', 'const'], axis=1)

And then create new OLS object using updated X_tilde:

regression_model=sm.OLS(y, X_tilde)

1. Which of the features you should KEEP?

2. Choose the INCORRECT statement.

question mark

Which of the features you should KEEP?

Select the correct answer

question mark

Choose the INCORRECT statement.

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 2. Chapter 4
some-alt