[Python] Running CatBoost classifier on data with categorical (non-numerical) features

The CatBoost (Categorical Boosting) is an open-source gradient boosting algorithm on decision trees. Developed by Yandex, it has recently replaced XGboost from the numero uno position in Kaggle competitions. Yandex uses this algorithm extensively for search, recommendation systems, personal assistants, self-driving cars, weather prediction, and many other tasks.

The CatBoost is one of the few machine learning algorithms that allow the use of non-numeric features instead of having to pre-process data or spend time and effort turning it into numbers. Also, the algorithm claims to provide excellent results with default parameters. So, you can save your time spent on parameter tuning.

Since CatBoost works on non-numerical data, this post will show how to use categorical data to train and test a CatBoost model. I have used Mushroom data available at the UCI repository for learning the model. CatBoostClassifier comes with a large number of parameters. Although I ran the model with default parameters and by setting some of the parameters, I got better classification results using the default parameters. Both versions are present in the below code, you can try them by commenting/uncommenting the line that calls “train_and_test_catboost()”.

How to train the model with categorical data

The fit() function of the CatBoostClassifier has a parameter “cat_features“. You need to provide a list of indices of the features that are non-numeric. For example, in mushroom data, all 22 features are non-numerical, so I can provide cat_features=[0,1,2,…,21]. Some datasets may have both numerical and non-numerical features. CatBoostClassifier also works on those datasets.

I am running 5-fold cross-validation in the following code to train/test the CatBoost model. The classification results (AUC-ROC, Accuracy) are generated using predicted labels and probabilities of the 5-fold CV.

Find the complete Python code and let me know if your find any error in it:

Python Code for training and testing the CatBoost Classifier

import numpy as np
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score

def fetch_data(ifile):
    Read the data file and generate ML data and labels
    df = pd.read_csv(ifile, sep=",", header=None)
    y = df.iloc[:, 0].to_numpy()
    y = np.asarray([1 if v == 'p' else 0 for v in y])
    df = df.iloc[:, 1:]
    X = df.to_numpy()
    return X, y

def train_and_test_catboost(data, label, params=None, n_folds=5, rseed=7):
    Run k-fold CV using catboost classifier
    # instantiate the model
    if params is None:
        model = CatBoostClassifier(random_seed=rseed, verbose=False)
        model = CatBoostClassifier(**params)

    # generate train/test sets for k-fold and run models
    skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=rseed)
    cat_features_idx = [i for i in range(data.shape[1])]  # indices of categorical features
    k = 1
    y_true = []
    y_pred = []
    y_prob = []

    for tr_idx, te_idx in skf.split(data, label):
        print("Processing fold: {0}".format(k))
        model.fit(data[tr_idx], label[tr_idx], cat_features=cat_features_idx)
        if k == 1:
            y_true = label[te_idx]
            y_pred = model.predict(data[te_idx])
            y_prob = model.predict_proba(data[te_idx])
            y_true = np.concatenate([y_true, label[te_idx]])
            y_pred = np.concatenate([y_pred, model.predict(data[te_idx])])
            y_prob = np.concatenate([y_prob, model.predict_proba(data[te_idx])], axis=0)
        k += 1

    return y_true, y_pred, y_prob

def classification_results(y_true=None, y_pred=None, probs=None):
    Compute classification performance
    print("AUC-ROC: ", roc_auc_score(y_true, probs[:, 1]))
    print("Accuracy: ", accuracy_score(y_true, y_pred))

if __name__ == "__main__":
    inp_file = "data/mushroom.data"    # Mushroom data file

    # train and test the catboost model
    params = {'iterations': 100,
              'depth': 6,
              'learning_rate': 0.20,
              'loss_function': 'Logloss',
              'random_seed': 7,
              'verbose': False}

    # get data from the file
    xdata, ydata = fetch_data(inp_file)

    # Train and test the model - with params and without params
    # tru_y, pred_y, pred_probs = train_and_test_catboost(xdata, ydata, params=params, n_folds=5, rseed=7)
    tru_y, pred_y, pred_probs = train_and_test_catboost(xdata, ydata, n_folds=5, rseed=7)

    # show classification results
    classification_results(y_true=tru_y, y_pred=pred_y, probs=pred_probs)

The above code gave the following classification results: (The code was run on Ubuntu 20 with Python 3.8.10)

AUC-ROC: 1.0
Accuracy: 1.0

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.