Learn Interpolation vs Extrapolation

In the previous chapter, we noticed that our predictions using different models are getting more diverse at the edges.

To be more precise, the predictions are getting weird at the moment we are past the range of values from the training set. Predicting values outside the training set's range is called extrapolation, and predicting values inside the range is interpolation.

The Regression does not handle the extrapolation well. It is used for interpolation and can yield absurd predictions when new instances are out of the training set's range.

Confidence Intervals

Using the OLS class, you can also get the confidence intervals for the regression line at any point. But the syntax is a bit complicated:

lower = regression_model.get_prediction(X_new_tilde).summary_frame(alpha)['mean_ci_lower']
upper = regression_model.get_prediction(X_new_tilde).summary_frame(alpha)['mean_ci_upper']

Where alpha is a confidence level, usually set to 0.05.
Using the above code, you will get lower and upper bounds of the regression line's confidence interval at the point X_new_tilde (or an array of upper and lower bounds if X_new_tilde is an array).

Given this, we can now plot the regression line along with its confidence interval:


              12345678910111213141516171819202122
            
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.preprocessing import PolynomialFeatures # Import PolynomialFeatures class

file_link = 'https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/poly.csv'
df = pd.read_csv(file_link)
n = 4   # A degree of the polynomial regression
X = df[['Feature']] # Assign X as a DataFrame
y = df['Target'] # Assign y
X_tilde = PolynomialFeatures(n).fit_transform(X) # Get X_tilde
regression_model = sm.OLS(y, X_tilde).fit() # Initialize and train the model
X_new = np.linspace(-0.1, 1.5, 80) # 1-d array of new feature values
X_new_tilde = PolynomialFeatures(n).fit_transform(X_new.reshape(-1,1)) # Transform X_new for predict() method
y_pred = regression_model.predict(X_new_tilde)
lower = regression_model.get_prediction(X_new_tilde).summary_frame(0.05)['mean_ci_lower'] # Get lower bound for each point
upper = regression_model.get_prediction(X_new_tilde).summary_frame(0.05)['mean_ci_upper'] # get upper bound for each point
plt.scatter(X, y)	# Build a scatterplot
plt.plot(X_new, y_pred)	# Build a Polynomial Regression graph
plt.fill_between(X_new, lower, upper, alpha=0.4)
plt.show()

Without knowing the distribution of a target, we can't find the exact regression line. All we do is try to approximate it based on our data. The confidence interval of the regression line is the interval in which the exact regression line lies with the confidence level alpha.
You can see that the interval grows larger and larger as it gets further from the training set's range.

Note

The confidence intervals are built assuming we correctly chose the model (e.g., Simple Linear Regression or Polynomial Regression of degree 4).

If the model is chosen poorly, the confidence interval is unreliable, and so is the line itself. You will learn how to select the best model in the following section.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 4

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Swipe to show menu

In the previous chapter, we noticed that our predictions using different models are getting more diverse at the edges.

The Regression does not handle the extrapolation well. It is used for interpolation and can yield absurd predictions when new instances are out of the training set's range.

Confidence Intervals

Using the OLS class, you can also get the confidence intervals for the regression line at any point. But the syntax is a bit complicated:

lower = regression_model.get_prediction(X_new_tilde).summary_frame(alpha)['mean_ci_lower']
upper = regression_model.get_prediction(X_new_tilde).summary_frame(alpha)['mean_ci_upper']

Given this, we can now plot the regression line along with its confidence interval:


              12345678910111213141516171819202122
            
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.preprocessing import PolynomialFeatures # Import PolynomialFeatures class

file_link = 'https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/poly.csv'
df = pd.read_csv(file_link)
n = 4   # A degree of the polynomial regression
X = df[['Feature']] # Assign X as a DataFrame
y = df['Target'] # Assign y
X_tilde = PolynomialFeatures(n).fit_transform(X) # Get X_tilde
regression_model = sm.OLS(y, X_tilde).fit() # Initialize and train the model
X_new = np.linspace(-0.1, 1.5, 80) # 1-d array of new feature values
X_new_tilde = PolynomialFeatures(n).fit_transform(X_new.reshape(-1,1)) # Transform X_new for predict() method
y_pred = regression_model.predict(X_new_tilde)
lower = regression_model.get_prediction(X_new_tilde).summary_frame(0.05)['mean_ci_lower'] # Get lower bound for each point
upper = regression_model.get_prediction(X_new_tilde).summary_frame(0.05)['mean_ci_upper'] # get upper bound for each point
plt.scatter(X, y)	# Build a scatterplot
plt.plot(X_new, y_pred)	# Build a Polynomial Regression graph
plt.fill_between(X_new, lower, upper, alpha=0.4)
plt.show()

Note

The confidence intervals are built assuming we correctly chose the model (e.g., Simple Linear Regression or Polynomial Regression of degree 4).

If the model is chosen poorly, the confidence interval is unreliable, and so is the line itself. You will learn how to select the best model in the following section.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 4

Interpolation vs Extrapolation

Confidence Intervals

Awesome!

Interpolation vs Extrapolation

Confidence Intervals