[Python] XGBoost parameters optimization (tuning) using GridSearchCV for imbalanced data

XGBoost is an optimized Gradient Boosting algorithm that provides a parallel tree boosting (also known as GBDT, GBM). It is highly efficient and one of the most preferred machine learning algorithms in the Kaggle competition. The algorithm is well optimized, and hence most of the time, you can get good results using its default parameters. However, like other machine learning algorithms, its performance falters when the data are imbalanced or contain noise (label or attribute noise).

Class imbalance is a common problem in real-world data, and applying XGBoost with default parameters on imbalanced data may result in poor classification performance. So, it’s always advisable to run GridSearch to tune XGBoost parameters.

Here, I am not going to discuss all parameters of XGBoost. You can find the complete list of parameters here. This post will show how to use sklearn’s GridSearchCV() function to find the optimal parameters. This function performs an exhaustive search over specified parameter values for an estimator. Remember, GridSearchCV() can take hours/days to optimize several parameters over a big range depending on the system. So, go through the list of parameters carefully and decide which parameters you want to tune.

If the dataset is imbalanced, XGBoost tends to be biased towards the majority class. If the positive class is the minority and the negative class is the majority, the value of recall (true positive/positive) will be small as many positive examples will be classified as negative examples. So, if you want the ML algorithm to have better “recall,” you can use “recall” as the “scoring” parameter in the GridSearchCV() function. Thus, the function will use “recall” as the performance parameter of the cross-validated model on the test set. The “scoring” parameter of the GridSearchCV() function must be wisely chosen to get the optimal estimator for the given data.

I generated synthetic data with 20,000 samples (80% class 0 and 20% class 1). Although the data are not highly imbalanced, I used “recall” as scoring. I am optimizing only two XGBoost parameters: max_depth and scale_pos_weight. If you want, you can add more parameters; look into the parameters list and select parameter(s) for tuning.

Here is the complete code for XGBoost parameter tuning:

from sklearn.datasets import make_classification
from collections import Counter
import numpy as np
import xgboost as xgb
from sklearn.metrics import confusion_matrix, roc_auc_score
from sklearn.model_selection import GridSearchCV


def generate_synthetic_data(n_samples=1000, n_features=25, n_classes=2, n_clusters_per_class=1, weights=None,
                            flip_y=0, class_sep=0.3):
    """
    Generate synthetic data using sklearn make_classification
    """
    X, y = make_classification(n_samples=n_samples, n_features=n_features, n_classes=n_classes, weights=weights,
                               n_clusters_per_class=n_clusters_per_class, flip_y=flip_y, class_sep=class_sep,
                               random_state=7)
    return X, y


def find_optimal_parameters(X, y):
    """
    Use GridSearch to find optimal parameters for the XGboost model.
    Here I have kept only 2 parameters for optimization, you can add more parameters and provide
    the range of values you want to check.
    Since my dataset is imbalanced, I am using scoring='recall'
    """
    tot = Counter(y)
    print("Ratio of majority class to minority class: ", tot[0] / tot[1])
    parameters = {'max_depth': [3, 4, 5],
                  'scale_pos_weight': [i for i in range(1, round(tot[0] / tot[1])+1, 1)]}

    clf = xgb.XGBClassifier(eval_metric='logloss', use_label_encoder=False)
    model = GridSearchCV(clf, parameters, scoring='roc_auc', cv=5, n_jobs=4, verbose=7)
    model.fit(X, y)
    y_preds = model.predict(X)
    y_probs = model.predict_proba(X)
    print("AUC-ROC: ", roc_auc_score(y, y_probs[:, 1]))
    print("tn, fp, fn, tp: ", confusion_matrix(y, y_preds).ravel())
    print(model.best_params_)
    print(model.best_estimator_)


if __name__ == "__main__":
    # generate synthetic data
    data, labels = generate_synthetic_data(n_samples=20000, n_features=10, n_classes=2, weights=[0.80, 0.20])
    print(Counter(labels))
    find_optimal_parameters(data, labels)

When scoring=’recall’, the code gave the following results:

AUC-ROC:  0.955228640625
tn, fp, fn, tp:  [14590  1410   598  3402]
{'max_depth': 3, 'scale_pos_weight': 4}
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              eval_metric='logloss', gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=3, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=8,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=4, subsample=1,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=None)

When scoring=’roc_auc’, the code gave the following results:

AUC-ROC:  0.955563984375
tn, fp, fn, tp:  [15242   758   833  3167]
{'max_depth': 3, 'scale_pos_weight': 2}
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              eval_metric='logloss', gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=3, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=8,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=2, subsample=1,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=None)

From the above results, you can see that when I used “recall” as the scoring parameter in the GridSearch function, the estimator returned a bigger true positive value. The value of AUC is comparable. So, in my opinion, when you are dealing with the class imbalance problem and want to run GridSearch, try to use “recall” as the scoring parameter.

References:

  1. sklearn GridSearch

Similar Posts

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.