Course Content
Linear Regression with Python
Linear Regression with Python
Choosing the Features
In many tasks, you may have many features that you don't know whether they are useful or not. Luckily you can train a model with all of them, see which ones are not impactful, and then re-train the model with only impactful ones.
Why would you remove any features from the model?
If you add a feature to the model that has no relation to the target, it will create some noise in the model, making the prediction worse. And when you have many useless features, the noise will stack up, making the model perform worse and worse.
How do you know if the features are good or bad?
As you already know, while training, the OLS class also calculates statistical information. Amongst many other things, it runs a t-test to determine whether each feature has a significant impact on the target. The results of the test can be found in the summary()
table as shown below:
import pandas as pd import statsmodels.api as sm file_link='https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/heights_two_feature.csv' df = pd.read_csv(file_link) # Open the file X,y = df[['Father', 'Mother']], df['Height'] # Assign the variables X_tilde = sm.add_constant(X) # Create X_tilde regression_model = sm.OLS(y, X_tilde).fit() # Initialize and train an OLS object print(regression_model.summary()) #Get the summary
What we are interested in is the p-value for each feature.
In short, the lower p-value the higher confidence that the feature is impactful
In statistics, we need to set the threshold of the p-value, called significance level. It is usually set to 0.05, and once the p-value is greater than that significance level, we consider the feature not impactful.
However in practice features with a little higher p-value than 0.05 usually improve the model too. So it is better to try the model with and without that feature instead of instantaneously removing it from the model, unless it has a really high p-value(>0.4). In that case you can safely remove the feature.
Note
p-value ranges from 0 to 1, so when we talk about low p-value we mean less than 0.05 and high p-value usually means greater than 0.3-0.5.
In our example, we got p-values for Mother's height and constant 0.087 and 0.051. If we remove features with a p-value > 0.05, we will get the result below(on the left).
Even visually, we can tell that the model with constant (on the right) is better, so it's better not to remove it from the model.
Note
It is usual to get relatively high p-values (0.05 to 0.2) for impactful features when the dataset is small. P-value shows the confidence that the feature is impactful, and it is natural that the more instances we have, the easier it is to distinguish impactful features from bad ones.
Say you have data showing that 9/10 of tall people you know ate apples daily. Can you be sure that this is related? But what if it were 9000/10000? That would make you more confident.
How do you remove bad features?
You just need to remove the column related to the feature from X_tilde
. This can be done using the following code:
For example, to drop columns 'const' and 'Mother' you would use:
And then create new OLS object using updated X_tilde
:
1. Which of the features you should KEEP?
2. Choose the INCORRECT statement.
Thanks for your feedback!