[Python] Probability calibration for imbalanced data using Isotonic method

Class probability estimation is more important than class label prediction in many real-world classification problems for decision-making. For example, in determining protein-disease association (how likely a protein is associated with a given disease), knowing the class probabilities will provide more meaningful information than the class label. The class probability offers a measure of confidence in a model’s binary prediction.

Many classification algorithms can return class probabilities as well as class labels. However, these algorithms usually fail to return calibrated probabilities, esp. when the dataset is imbalanced. Class imbalance is a prevalent issue in real-world data, i.e., the number of instances of one class is very big while that of other class is small. The majority of the traditional classification algorithms work well on only perfectly labeled balanced data. Those algorithms tend to be biased towards the majority class on the data with class imbalance. Thus, applying traditional classification techniques on imbalanced data results in biased output and high uncertainty for probability calibration.

Getting well-calibrated probabilities from a machine learning model is still an open question in the class imbalance problem. Some methods have been proposed; however, they do not work on all data. So, researchers are still exploring different theoretical and empirical approaches.

Calibration is a kind of optimization technique to obtain accurate probability estimation for the samples in given data. If the probabilities are not calibrated, the confidence in the probabilistic estimation will be low.

In this post, I applied the XGBoost classification algorithm, with/without calibration, on imbalanced synthetic data generated using the make_classification() method of sklearn. The synthetic data contains 20,000 samples with 95% class 0 and 5% class 1. I have applied isotonic calibration to the XGBoost model to get calibrated probabilities.

I created the calibration plots using the predicted probabilities from both the uncalibrated and calibrated XGboost models. The calibration plot shows the actual frequency of the positive label against its predicted probability for each bin. The x-axis represents the average predicted probability in each bin. The y-axis represents the fraction of positives, i.e., the proportion of samples whose class is the positive (in each bin). The calibration curve will follow the y=x line if the model returns well-calibrated probabilities. If the calibration curve is below the y=x line, it means the average predicted probability in each bin is greater than the fraction of positives. On the other hand, if the calibration curve is above the y=x line, it means the average predicted probability in each bin is less than the fraction of positives. The calibration curve generated using the uncalibrated model does not follow the y=x line in the class imbalance problem.

Here is the complete code to generate calibration curves using uncalibrated and calibrated probabilities:

from sklearn.calibration import CalibratedClassifierCV
from sklearn.datasets import make_classification
from collections import Counter
import xgboost as xgb
from sklearn.model_selection import cross_val_predict
from sklearn.calibration import CalibrationDisplay
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold


def generate_synthetic_data(n_samples=1000, n_features=25, n_classes=2, n_clusters_per_class=1, weights=None,
                            flip_y=0, class_sep=0.3):
    """
    Generate synthetic data using sklearn make_classification
    """
    X, y = make_classification(n_samples=n_samples, n_features=n_features, n_classes=n_classes, weights=weights,
                               n_clusters_per_class=n_clusters_per_class, flip_y=flip_y, class_sep=class_sep,
                               random_state=7)
    return X, y


def calibration_plot_without_calibrated_model(X, y):
    """
    generate calibration plot using the predicted probabilities from an uncalibrated model
    """
    model = xgb.XGBClassifier()
    probs = cross_val_predict(model, X, y, cv=5, method='predict_proba')[:, 1]
    CalibrationDisplay.from_predictions(y, probs)
    plt.show()


def calibration_plot_with_calibrated_model(X, y):
    """
    generate calibration plot using the predicted probabilities from a calibrated model
    """
    # with calibrated model
    y_probs = []
    y_test = []
    model = xgb.XGBClassifier()
    skf = StratifiedKFold(n_splits=5, random_state=7, shuffle=True)
    for tr_idx, te_idx in skf.split(data, labels):
        calibrated_model = CalibratedClassifierCV(base_estimator=model, cv=5, method='isotonic', n_jobs=4)
        calibrated_model.fit(X[tr_idx], y[tr_idx])
        y_test.extend(y[te_idx])
        y_probs.extend(calibrated_model.predict_proba(X[te_idx])[:, 1])

    CalibrationDisplay.from_predictions(y_test, y_probs)
    plt.show()


if __name__ == "__main__":
    # generate synthetic data
    data, labels = generate_synthetic_data(n_samples=20000, n_features=25, n_classes=2, weights=[0.95, 0.05])
    print(Counter(labels))

    # generate calibration plot using uncalibrated probabilities
    calibration_plot_without_calibrated_model(data, labels)

    # generate calibration plot using calibrated probabilities
    calibration_plot_with_calibrated_model(data, labels)

The calibration curve generated using uncalibrated probabilities is as follows:

Calibration curve using uncalibrated probabilities

The calibration curve generated using calibrated probabilities is as follows:

Calibration curve using calibrated probabilities

References:

  1. sklearn calibrated model
  2. sklearn calibration

Similar Posts

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.