Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Predict House Prices | Simple Linear Regression
Linear Regression with Python

book
Predict House Prices

Let's build a real-world example regression model. We have a file, houses_simple.csv, that holds information about housing prices with its area as a feature.

import pandas as pd

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/houses_simple.csv')
print(df.head())
1234
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/houses_simple.csv') print(df.head())
copy

Let's assign variables and visualize our dataset!

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/houses_simple.csv')
X = df['square_feet']
y = df['price']

plt.scatter(X, y, alpha=0.5)
plt.show()
123456789
import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/houses_simple.csv') X = df['square_feet'] y = df['price'] plt.scatter(X, y, alpha=0.5) plt.show()
copy

In the example with a person's height, it was much easier to imagine a line fitting the data well.
But now our data has much more variance since the target highly depends on many other things like age, location, interior, etc.
Anyway, the task is to build the line that best fits the data we have; it will show the trend. The OLS class should be used for that. Soon we will learn how to add more features, it will make the prediction better!

Task

Swipe to start coding

  1. Assign the 'price' column of df to y.
  2. Create the X_tilde matrix using the add_constant() function from statsmodels(imported as sm).
  3. Initialize the OLS object and train it.
  4. Preprocess X_new array the same way as X.
  5. Predict the target for X_new_tilde matrix.

Solution

import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/houses_simple.csv')
# Assign the variables
X = df['square_feet']
y = df['price']
# Preprocess X
X_tilde = sm.add_constant(X)
# Build and train a model
regression_model = sm.OLS(y, X_tilde).fit()
# Create and preprocess X_new
X_new = np.array([1300, 10000, 25000])
X_new_tilde = sm.add_constant(X_new)
# Predict the target for X_new
y_pred = regression_model.predict(X_new_tilde)
print(y_pred)
# Plot the data points and prediction line
plt.scatter(X, y, alpha=0.5)
plt.plot(X_new, y_pred)
plt.show()
Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 5
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/houses_simple.csv')
# Assign the variables
X = df['square_feet']
y = df['___']
# Preprocess X
X_tilde = ___.___(X)
# Build and train a model
regression_model = sm.___(y, ___).fit()
# Create and preprocess X_new
X_new = np.array([1300, 10000, 25000])
X_new_tilde = ___.___(X_new)
# Predict the target for X_new
y_pred = regression_model.___(X_new_tilde)
print(y_pred)
# Plot the data points and prediction line
plt.scatter(X, y, alpha=0.5)
plt.plot(X_new, y_pred)
plt.show()
toggle bottom row
some-alt