Learn Choosing the Features | Multiple Linear Regression

In many tasks, you may have many features that you don't know whether they are useful or not. Luckily you can train a model with all of them, see which ones are not impactful, and then re-train the model with only impactful ones.

Why Remove Features from the Model?

If you add a feature to the model that has no relation to the target, it will create some noise in the model, making the prediction worse. And when you have many useless features, the noise will stack up, making the model perform worse and worse.

How to Know if the Features Are Good or Bad?

As you already know, while training, the OLS class also calculates statistical information. Amongst many other things, it runs a t-test to determine whether each feature has a significant impact on the target. The results of the test can be found in the summary() table as shown below:


              123456789
            
import pandas as pd
import statsmodels.api as sm

file_link='https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/heights_two_feature.csv'
df = pd.read_csv(file_link)	  # Open the file
X,y = df[['Father', 'Mother']], df['Height']   # Assign the variables
X_tilde = sm.add_constant(X)	# Create X_tilde
regression_model = sm.OLS(y, X_tilde).fit()  # Initialize and train an OLS object
print(regression_model.summary())	#Get the summary

What we are interested in is the p-value for each feature.

In short, the lower p-value the higher confidence that the feature is impactful

In statistics, we need to set the threshold of the p-value, called significance level. It is usually set to 0.05, and once the p-value is greater than that significance level, we consider the feature not impactful.

However in practice features with a little higher p-value than 0.05 usually improve the model too. So it is better to try the model with and without that feature instead of instantaneously removing it from the model, unless it has a really high p-value(>0.4). In that case you can safely remove the feature.

Note

p-value ranges from 0 to 1, so when we talk about low p-value we mean less than 0.05 and high p-value usually means greater than 0.3-0.5.

In our example, we got p-values for Mother's height and constant 0.087 and 0.051. If we remove features with a p-value > 0.05, we will get the result below(on the left).

Even visually, we can tell that the model with constant (on the right) is better, so it's better not to remove it from the model.

Note

It is usual to get relatively high p-values (0.05 to 0.2) for impactful features when the dataset is small. P-value shows the confidence that the feature is impactful, and it is natural that the more instances we have, the easier it is to distinguish impactful features from bad ones.
Say you have data showing that 9/10 of tall people you know ate apples daily. Can you be sure that this is related? But what if it were 9000/10000? That would make you more confident.

How to Remove Bad Features?

You just need to remove the column related to the feature from X_tilde. This can be done using the following code:

X_tilde = X_tilde.drop(___, axis=1)

For example, to drop columns 'const' and 'Mother' you would use:

X_tilde = X_tilde.drop(['Mother', 'const'], axis=1)

And then create new OLS object using updated X_tilde:

regression_model=sm.OLS(y, X_tilde)

1. Which of the features you should KEEP?

2. Choose the INCORRECT statement.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 4

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Why Remove Features from the Model?

How to Know if the Features Are Good or Bad?


              123456789
            
import pandas as pd
import statsmodels.api as sm

file_link='https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/heights_two_feature.csv'
df = pd.read_csv(file_link)	  # Open the file
X,y = df[['Father', 'Mother']], df['Height']   # Assign the variables
X_tilde = sm.add_constant(X)	# Create X_tilde
regression_model = sm.OLS(y, X_tilde).fit()  # Initialize and train an OLS object
print(regression_model.summary())	#Get the summary

What we are interested in is the p-value for each feature.

In short, the lower p-value the higher confidence that the feature is impactful

Note

p-value ranges from 0 to 1, so when we talk about low p-value we mean less than 0.05 and high p-value usually means greater than 0.3-0.5.

In our example, we got p-values for Mother's height and constant 0.087 and 0.051. If we remove features with a p-value > 0.05, we will get the result below(on the left).

Even visually, we can tell that the model with constant (on the right) is better, so it's better not to remove it from the model.

Note

How to Remove Bad Features?

You just need to remove the column related to the feature from X_tilde. This can be done using the following code:

X_tilde = X_tilde.drop(___, axis=1)

For example, to drop columns 'const' and 'Mother' you would use:

X_tilde = X_tilde.drop(['Mother', 'const'], axis=1)

And then create new OLS object using updated X_tilde:

regression_model=sm.OLS(y, X_tilde)

1. Which of the features you should KEEP?

2. Choose the INCORRECT statement.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 4