You will now build a real-world example regression model. You have a file, `houses_simple.csv`, that holds information about housing prices with its area as a feature.

import pandas as pd

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/houses_simple.csv')
print(df.head())

The next step is to assign variables and visualize the dataset:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b22d1166-efda-45e8-979e-6c3ecfc566fc/houses_simple.csv')
X = df['square_feet']
y = df['price']

plt.scatter(X, y, alpha=0.5)
plt.show()

In the example with a person's height, it was much easier to imagine a line fitting the data well.  

But now our data has much more variance since the target highly depends on many other things like age, location, interior, etc.  
Anyway, the task is to build the line that best fits the data we have; it will show the trend. The `OLS` class should be used for that. Soon we will learn how to add more features, it will make the prediction better!

import unittest
import importlib
import numpy as np
import pandas as pd
import statsmodels.api as sm


# Helper for dynamic test names
def _dynamic_test(test_case, condition, ok_msg, fail_msg):
    if condition:
        test_case._testMethodName = ok_msg
        test_case.assertTrue(True)
    else:
        test_case._testMethodName = fail_msg
        test_case.fail(fail_msg)


class TestUserCode(unittest.TestCase):

    def test_y_is_price(self):
        import user_code

        condition = (
            hasattr(user_code, "y") and
            isinstance(user_code.y, pd.Series) and
            user_code.y.name == "price"
        )

        _dynamic_test(
            self,
            condition,
            "The `y` variable correctly contains the `price` column.",
            "Expected `y` to be assigned as df['price']."
        )

    def test_X_tilde_is_add_constant(self):
        import user_code

        condition = (
            hasattr(user_code, "X_tilde") and
            isinstance(user_code.X_tilde, pd.DataFrame) and
            "const" in user_code.X_tilde.columns
        )

        _dynamic_test(
            self,
            condition,
            "The `X_tilde` matrix is created using sm.add_constant.",
            "Expected `X_tilde` to contain a constant column (using sm.add_constant)."
        )

    def test_regression_model_is_ols(self):
        import user_code
        from statsmodels.regression.linear_model import RegressionResultsWrapper

        condition = (
            hasattr(user_code, "regression_model") and
            isinstance(user_code.regression_model, RegressionResultsWrapper)
        )

        _dynamic_test(
            self,
            condition,
            "The model is an instance of OLS and is fitted.",
            "Expected `regression_model` to be a fitted OLS model."
        )

    def test_X_new_tilde_correct(self):
        import user_code

        condition = (
            hasattr(user_code, "X_new_tilde") and
            hasattr(user_code, "X_new") and
            isinstance(user_code.X_new_tilde, np.ndarray) and
            user_code.X_new_tilde.shape == (3, 2)  # 3 samples + constant
        )

        _dynamic_test(
            self,
            condition,
            "The `X_new_tilde` matrix is correctly created with a constant column.",
            "Expected `X_new_tilde` to be a 2-column array created using sm.add_constant."
        )

    def test_y_pred_is_correct_shape(self):
        import user_code
        import numpy as np

        condition = (
            hasattr(user_code, "y_pred") and
            isinstance(user_code.y_pred, np.ndarray) and
            user_code.y_pred.shape == (3,)
        )

        _dynamic_test(
            self,
            condition,
            "The `y_pred` array has the correct shape (3,).",
            "Expected `y_pred` to be a NumPy array with shape (3,)."
        )

    def test_predict_called(self):
        """
        Checks that the predictions are numbers and reasonable (not NaN or None).
        """
        import user_code
        import numpy as np

        try:
            preds = user_code.y_pred
            condition = (
                isinstance(preds, np.ndarray) and
                np.all(~np.isnan(preds)) and
                preds.size == 3
            )
        except Exception:
            condition = False

        _dynamic_test(
            self,
            condition,
            "The predictions are valid numeric outputs.",
            "Expected `y_pred` to contain valid numeric values."
        )


if __name__ == "__main__":
    unittest.main()


test_code.py

Linear Regression is a crucial concept in predictive analytics. It is widely used by data scientists, data analytics, and statisticians as it is easy to build and interpret but powerful enough for many tasks.

Let's start with the simplest Linear Regression model! You will learn the idea behind Linear Regression and how to make predictions in Python.

Most real-world prediction tasks involve more than one feature. You will learn how to handle Linear Regression with multiple features.

A straight line does not always describe the data well. Let's learn how to build a more complex model for prediction! That's what the Polynomial Regression is suited for.

Now that you know how to build many Linear Regression models, you need a way to choose the best one. This is achievable using metrics. This section explains the most used ones and the difficulties you can face using them.

Challenge: Predicting House Prices

Solution