[Python] Running regression predictive model on California housing dataset

In this post, I will show how you can run a predictive regression model on a given dataset. The regression problem differs from the classification problem. In a classification problem, you are required to predict a discrete class label (or class probability) for given samples in data. On the other hand, you predict a continuous/integer value for given examples in the regression problem. Some common regression problems are forecasting house prices, estimating salaries, predicting the future stock price, etc.

To evaluate the performance of a regression model, you can not use well-known metrics such as accuracy, AUC, or F1. Regression models are evaluated using different metrics such as root mean squared error (RMSE), R2 score, or mean absolute error (MAE).

About the California housing dataset

I have used the California housing dataset available in the scikit-learn library in this post. The dataset contains 20,640 instances with 8 numeric attributes and numeric target values. There is no missing value or null value in the dataset. The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000). Those 8 attributes are as follows:

  • MedInc: median income in block group
  • HouseAge: median house age in block group
  • AveRooms: average number of rooms per household
  • AveBedrms: average number of bedrooms per household
  • Population: block group population
  • AveOccup: average number of household members
  • Latitude: block group latitude
  • Longitude: block group longitude

To fetch the data, you can use the following code:

hdata = fetch_california_housing()
X, y, headers = hdata['data'], hdata['target'], hdata['feature_names']

The returned value ‘hdata’ is a dictionary that contains ‘data’, ‘target’, ‘feature_names’ and some other details. Some of the features and targets are as follows:

'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
          37.88      , -122.23      ],
       [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
          37.86      , -122.22      ],
       [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
          37.85      , -122.24      ],
       ...,
       [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
          39.43      , -121.22      ],
       [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
          39.43      , -121.32      ],
       [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
          39.37      , -121.24      ]]), 
'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])

Regression model and python code

I am using the CatBoost regression model in the code as it performed better than other algorithms. Also, the CatBoost algorithm sets the default parameters by analyzing the features and size of the data. Most of the time, one does not have to run GridSearch to find the optimal parameters.

Here is the complete code to fetch data, run a regression model with 5-fold cross-validation, and evaluate the model’s performance using RMSE, R2, and MAE.

from sklearn.datasets import fetch_california_housing
from catboost import CatBoostRegressor
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error, median_absolute_error, r2_score


def main():
    """
    start of the code
    """
    # fetch housing data and labels
    hdata = fetch_california_housing()
    X, y, headers = hdata['data'], hdata['target'], hdata['feature_names']
    # print(headers)
    # print(X)
    # print(y)

    # run regression model with 5-fold CV
    clf = CatBoostRegressor(verbose=False)
    preds = cross_val_predict(clf, X, y, method="predict")

    # calculate prediction performance
    print("R2 score: ", r2_score(y, preds))
    print("Root Mean Square Error: ", mean_squared_error(y, preds, squared=False))
    print("Mean Absolute Error: ", median_absolute_error(y, preds))


if __name__ == "__main__":
    main()

Output of the code:

The above code returns the following output:

R2 score: 0.722705761583593
Root Mean Square Error: 0.6076439964109932
Mean Absolute Error: 0.3142923456700748

Suppose you have to run a predictive regression model on a dataset. In that case, you need to extract features (X) and targets (y) from the data and apply a regression algorithm on X and y to estimate the prediction. If you have separate train and test sets, you can train the model on the train set using the fit() function of the CatBoost library and test the fitted model on the test set using predict() function.

Here is an example to show how to use fit() and predict() functions:

from sklearn.datasets import fetch_california_housing
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error, median_absolute_error, r2_score
from sklearn.model_selection import train_test_split


def main():
    """
    start of the code
    """
    # fetch housing data and labels
    hdata = fetch_california_housing()
    X, y, headers = hdata['data'], hdata['target'], hdata['feature_names']

    # split data into train (60%) and test (40%) sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=1234)

    # train the model
    clf = CatBoostRegressor(verbose=False)
    clf.fit(X_train, y_train)

    # test the model
    preds = clf.predict(X_test)

    # calculate prediction performance
    print("R2 score: ", r2_score(y_test, preds))
    print("Root Mean Square Error: ", mean_squared_error(y_test, preds, squared=False))
    print("Mean Absolute Error: ", median_absolute_error(y_test, preds))


if __name__ == "__main__":
    main()

Please let me know if any syntax does not make sense.

Similar Posts

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.