Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Building Linear Regression Using Statsmodels | Simple Linear Regression
Linear Regression with Python

book
Building Linear Regression Using Statsmodels

In the previous chapter, we used a function from NumPy to calculate the parameters.
Now we will use the class object instead of the function to represent the linear regression. This approach takes more lines of code to find the parameters, but it stores a lot of helpful information inside the object and makes the prediction more straightforward.

Building a Linear Regression model

In statsmodels, the OLS class can be used to create a linear regression model.

We first need to initialize an OLS class object using sm.OLS(y, X_tilde). Then train it using the fit() method.

python
model = sm.OLS(y, X_tilde)
model = model.fit()

Which is equivalent to:

python
model = sm.OLS(y, X_tilde).fit()

Note

The constructor of the OLS class expects a specific array X_tilde as an input, which we saw in the Normal Equation. So you need to convert your X array to X_tilde. This is achievable using the sm.add_constant() function.

Finding parameters

When the model is trained, you can easily access the parameters using the params attribute.

import statsmodels.api as sm # import statsmodels
import pandas as pd

file_link = 'https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/simple_height_data.csv'
df = pd.read_csv(file_link) # Open the file
X, y = df['Father'], df['Height'] # Assign the variables
# Get the correct form of input for OLS
X_tilde = sm.add_constant(X)
# Initialize an OLS object
regression_model = sm.OLS(y, X_tilde)
# Train the object
regression_model = regression_model.fit()
# Get the paramters
beta_0, beta_1 = regression_model.params
print('beta_0 is: ', beta_0)
print('beta_1 is: ', beta_1)
12345678910111213141516
import statsmodels.api as sm # import statsmodels import pandas as pd file_link = 'https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/simple_height_data.csv' df = pd.read_csv(file_link) # Open the file X, y = df['Father'], df['Height'] # Assign the variables # Get the correct form of input for OLS X_tilde = sm.add_constant(X) # Initialize an OLS object regression_model = sm.OLS(y, X_tilde) # Train the object regression_model = regression_model.fit() # Get the paramters beta_0, beta_1 = regression_model.params print('beta_0 is: ', beta_0) print('beta_1 is: ', beta_1)
copy

Making the predictions

New instances can easily be predicted using predict() method, but you need to preprocess the input for them too:

import statsmodels.api as sm
import pandas as pd
import numpy as np

file_link = 'https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/simple_height_data.csv'
df = pd.read_csv(file_link) # Open the file
X, y = df['Father'], df['Height'] # Assign the variables

X_tilde = sm.add_constant(X) # Preprocess
regression_model = sm.OLS(y, X_tilde) # Initialize an OLS object
regression_model = regression_model.fit() # Train the object

# Predict new values
X_new = np.array([65,70,75]) # Feature values of new instances
X_new_tilde = sm.add_constant(X_new) # Preprocess X_new
y_pred = regression_model.predict(X_new_tilde) # Predict the target
print(y_pred)
1234567891011121314151617
import statsmodels.api as sm import pandas as pd import numpy as np file_link = 'https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/simple_height_data.csv' df = pd.read_csv(file_link) # Open the file X, y = df['Father'], df['Height'] # Assign the variables X_tilde = sm.add_constant(X) # Preprocess regression_model = sm.OLS(y, X_tilde) # Initialize an OLS object regression_model = regression_model.fit() # Train the object # Predict new values X_new = np.array([65,70,75]) # Feature values of new instances X_new_tilde = sm.add_constant(X_new) # Preprocess X_new y_pred = regression_model.predict(X_new_tilde) # Predict the target print(y_pred)
copy

Getting the summary

As you probably noticed, using the OLS class is not as easy as the polyfit() function. But using OLS has its benefits. While training, it calculates a lot of statistical information. You can access the information using the summary() method.

import statsmodels.api as sm
import pandas as pd

file_link = 'https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/simple_height_data.csv'
df = pd.read_csv(file_link) # Read the file
X, y = df['Father'], df['Height']

X_tilde = sm.add_constant(X) # Preprocess X
regression_model = sm.OLS(y, X_tilde) # Initialize an OLS object
regression_model = regression_model.fit() # Train the object
# Print the summary
print(regression_model.summary())
123456789101112
import statsmodels.api as sm import pandas as pd file_link = 'https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/simple_height_data.csv' df = pd.read_csv(file_link) # Read the file X, y = df['Father'], df['Height'] X_tilde = sm.add_constant(X) # Preprocess X regression_model = sm.OLS(y, X_tilde) # Initialize an OLS object regression_model = regression_model.fit() # Train the object # Print the summary print(regression_model.summary())
copy

That's a lot of statistics. We will discuss the table's most important parts in later sections.

question mark

Choose the INCORRECT statement.

Select the correct answer

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 4

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

some-alt