paint-brush
Probabilistic Predictions in Classification - Evaluating Qualityby@alekseiterentev
New Story

Probabilistic Predictions in Classification - Evaluating Quality

by ALEKSEI TERENTEV16mJanuary 4th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Binary classification is one of the most common machine learning tasks. In practice, the goal of such tasks often extends beyond simply predicting a class. Traditional metrics such as accuracy, recall or F-measure are not suitable for such tasks. Specialized tools are needed to assess the quality of probability predictions.
featured image - Probabilistic Predictions in Classification - Evaluating Quality
ALEKSEI TERENTEV HackerNoon profile picture
0-item
1-item


Binary classification is one of the most common machine learning tasks, encountered in numerous practical applications.


However, in practice, the goal of such tasks often extends beyond simply predicting a class. What becomes much more important is the model's ability to estimate the probability of an object belonging to one class or another. In other words, we are interested not only in which class to choose but also in how confident the model is in its decision.


Such tasks are quite frequent. For example, in credit scoring, there is a task of estimating the probability of client default — predicting whether a client will stop paying their loan. Banks use such models to make decisions based on the calculated default probabilities: whether to issue a loan and, if so, under what terms. In this context, precise probability estimation emerges as a pivotal factor shaping financial outcomes.


But how can we determine the accuracy of the model's predictions? Traditional metrics such as accuracy, recall, or F-measure are not suitable for such tasks. Specialized tools are needed to assess the quality of probability predictions.


In this article, I will share practical experience in evaluating probabilistic predictions, discuss the key metrics used in practice, and explain how to interpret them and what purposes they are best suited for.


The Binary Classification Problem in Formal Terms

Let us consider a dataset with l observations:



And let’s assume we have trained a binary classification model:


that predicts p_i, the probability that y_i = 1 for the object x_i.


Evaluating the Quality of Probability Predictions

Let’s try to evaluate the quality of probability predictions for such a classifier. What properties should an ideally predicted probability possess?


First, probabilities should effectively rank objects by their likelihood of belonging to a specific class. This means that an object with characteristics of class "1" should have a higher probability of belonging to this class than an object that lacks those characteristics.


Second, probabilities should be calibrated, meaning they should align with the true frequency of events. Calibration implies that the model’s predictions reflect the actual likelihood of an event. For instance, if the model predicts a probability of 0.8 for a group of objects, then 80% of those objects should indeed belong to the positive class. A calibrated model, therefore, not only ranks objects effectively but also provides meaningful and interpretable probability predictions.

Log Loss (Logarithmic Loss)

Evaluates how much the predicted probabilities p_i deviate from the true labels y_i​. The metric is calculated as follows:



The lower the Log Loss, the better the model predicts probabilities: it ranges from 0 (perfect predictions) to infinity (confident but incorrect predictions). The metric reaches its minimum when, for unambiguous objects, the model predicts probabilities close to 1 for the correct class and close to 0 for the others. For objects with characteristics of both classes, the probabilities should reflect their uncertainty, such as being closer to 0.5.


Log Loss correlates well with other probability evaluation metrics but is sensitive to outliers and can be challenging to interpret. For instance, a Log Loss value of 0.8 cannot always be definitively classified as "good" or "bad."


Application: Log Loss is suitable for comparing models (the one with the lower value is preferred) but is less helpful for assessing the absolute quality of predictions made by a single model.


from sklearn.metrics import log_loss
import numpy as np

# Example of true class labels (0 or 1)
y_true = [0, 1, 1, 0, 1]
# Example of predicted probabilities of belonging to class 1
y_pred_proba = [0.1, 0.9, 0.8, 0.3, 0.6]

# Calculation of Log Loss
logloss = log_loss(y_true, y_pred_proba)

print(f"Log Loss: {logloss}")


ROC-сurve и ROC-AUC

One of the most popular metrics is ROC-AUC.


The ROC Curve (Receiver Operating Characteristic curve) is a graph where the X-axis corresponds to the False Positive Rate (FPR), and the Y-axis corresponds to the True Positive Rate (TPR).




  • True Positives (TP): The number of objects correctly classified as class "1."
  • False Positives (FP): The number of objects incorrectly classified as class "1."
  • False Negatives (FN): The number of objects incorrectly classified as class "0."
  • True Negatives (TN): The number of objects correctly classified as class "0."



How to Build an ROC Curve?

  1. Sort the objects: Arrange the objects in descending order of the probability for class "1" predicted by the model.
  2. Start from zero: Assume that all objects belong to class "0" and calculate the TPR (True Positive Rate) and FPR (False Positive Rate).
  3. Add objects one by one: Take the first object (with the highest probability for class "1") and reclassify it as class "1." Recalculate the TPR and FPR. Repeat this for the next object, and so on, until all objects are classified as class "1."
  4. Plot the curve: Each time an object is added, record the TPR and FPR values as a pair and mark them on the graph.


The ROC curve always passes through the points (0,0) and (1,1). It starts at (0,0) when all objects are classified as class "0" and reaches (1,1) when all objects are classified as class "1."


If the curve is closer to the top-left corner, it indicates that the model performs well - correctly separating the classes with minimal errors. If the model's predictions are random, the curve will follow the diagonal from (0,0) to (1,1). Conversely, if the model frequently confuses the classes (e.g., labeling class "0" as "1" and vice versa), the curve will approach the bottom-right corner. To derive a numerical metric from this, the area under the ROC curve is calculated.


ROC-AUC (Area Under the ROC Curve) is the area under the ROC curve.

The larger this area, the better the model. ROC-AUC can take values between 0 and 1.


  • ROC-AUC = 1: Perfect model.
  • ROC-AUC = 0.5: Random predictions — the model found no patterns and is simply guessing.
  • ROC-AUC = 0: The model predicts classes in reverse.


ROC-AUC evaluates the model's ability to correctly rank objects based on their likelihood of belonging to a class, but it does not assess the calibration of probabilities — i.e., how well the predicted probabilities match the true event frequencies. For example, multiplying all probabilities by 1000 will result in values that are no longer valid probabilities, but the ROC-AUC will remain unchanged because scaling probabilities by a constant does not alter the ranking of objects.


In cases of severe class imbalance (e.g., when class "1" constitutes only a small percentage of the dataset), ROC-AUC may overestimate the model's quality, as rare false positives have little impact on the final score.


Application: ROC-AUC is used to evaluate how well a model ranks objects by their class membership probability. It is not suitable for assessing the calibration quality of predicted probabilities.


import numpy as np
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Example of true class labels
y_true = [0, 0, 1, 1, 0, 1, 1, 0, 1, 0]
# Example of predicted probabilities of belonging to class 1
y_pred_proba = [0.1, 0.4, 0.35, 0.8, 0.2, 0.7, 0.6, 0.3, 0.9, 0.5]

# Plotting the ROC curve
fpr, tpr, thresholds = roc_curve(y_true, y_pred_proba)

# Calculation of AUC
roc_auc = roc_auc_score(y_true, y_pred_proba)

# Plotting the graph
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random guess')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.grid()
plt.show()


PR-Curve и PR-AUC

PR-Curve (Precision-Recall Curve) is a graph that shows the relationship between Precision and Recall for different classification thresholds.


  • Precision is the proportion of true positive examples among all examples classified as positive:




  • Recall is the proportion of true positive examples among all actual positive examples:



  • True Positives (TP): The number of objects correctly classified as class "1."
  • False Positives (FP): The number of objects incorrectly classified as class "1."
  • False Negatives (FN): The number of objects incorrectly classified as class "0."



How to Build a PR-Curve?

  1. Sort the objects: Arrange the objects in descending order based on the probability of class "1" predicted by the model.
  2. Start from zero: Assume all objects belong to class "0" and calculate Precision and Recall.
  3. Add objects one by one: Take the first object (with the highest predicted probability for class "1") and classify it as class "1." Recalculate Precision and Recall. Repeat this process for the next object, and so on, until all objects are classified as class "1."
  4. Plot the curve: Each time an object is added, record the Precision and Recall values as a pair and plot them on the graph.

Interpretation:

  • If the curve is closer to the top-right corner, the model performs well, accurately identifying positive classes with minimal errors.
  • If the model frequently confuses classes (e.g., often labeling class "0" as "1" and vice versa), the curve will be closer to the bottom-left corner.
  • For random predictions, the shape of the curve will depend on the class balance in the dataset.


To derive a numerical metric from the curve, the area under the PR curve (Precision-Recall AUC) is calculated.


PR-AUC (Area Under the PR-Curve) is the area under the PR curve.


PR-AUC ranges from 0 to 1. The higher the area, the better the model.


  • PR-AUC = 1: Perfect model.
  • A low PR-AUC indicates a weak ability of the model to identify class "1."


PR-AUC shares similarities with ROC-AUC as both measure the quality of ranking objects by class membership. However, PR-Curve focuses on a single class, ignoring the other. PR-Curve is more suitable in cases where:


  • One class is significantly rarer than the other. PR-AUC highlights the classifier's performance on the rare class more effectively.
  • Correctly predicting one class is more important than balancing performance across both classes.


Application: PR-AUC is used to evaluate the quality of ranking objects by their likelihood of belonging to a class, particularly when one class is much rarer than the other or when focus needs to be on a single class. It is not used to assess the calibration of predicted probabilities.


import numpy as np
from sklearn.metrics import precision_score, recall_score, precision_recall_curve, auc
import matplotlib.pyplot as plt

# Example of data
y_true = np.array([0, 1, 1, 0, 1, 0, 1, 0, 0, 1])  # True labels (0 or 1)
y_scores = np.array([0.1, 0.9, 0.8, 0.3, 0.6, 0.4, 0.7, 0.2, 0.1, 0.85])  # Predicted probabilities for the positive class

# 1. Calculation of Precision and Recall for a fixed threshold
threshold = 0.5  # Example of a threshold
y_pred = (y_scores >= threshold).astype(int)  # Predictions based on the threshold

precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)

print(f"Precision (при пороге {threshold}): {precision:.2f}")
print(f"Recall (при пороге {threshold}): {recall:.2f}")

# 2. Plotting the PR curve
precision_vals, recall_vals, thresholds = precision_recall_curve(y_true, y_scores)

# Calculation of PR-AUC
pr_auc = auc(recall_vals, precision_vals)

print(f"PR-AUC: {pr_auc:.2f}")

# 3. Plotting the PR curve
plt.figure(figsize=(8, 6))
plt.plot(recall_vals, precision_vals, label=f'PR Curve (AUC = {pr_auc:.2f})', linewidth=2)
plt.xlabel('Recall', fontsize=12)
plt.ylabel('Precision', fontsize=12)
plt.title('Precision-Recall Curve', fontsize=14)
plt.legend(loc='best')
plt.grid(True)
plt.show()


Calibration Curve и Expected Calibration Error (ECE)

Reliability Diagram (Calibration Curve) and Expected Calibration Error (ECE) are tools for assessing the calibration of probabilistic models. They are used to analyze how well a model predicts outcome probabilities.


A Calibration Curve is a plot that shows the relationship between predicted probabilities and the actual frequency of successes.

How to Build a Calibration Curve?

  1. Sort the objects: Arrange the objects in ascending order of the probability of class "1" predicted by the model.
  2. Calculate the average probability and observed frequency: For each bin, compute the following:

    • Average predicted probability:


    • The proportion of true positive outcomes (empirical probability) is calculated as:

      where p_g is the number of objects in class "1" in bin g, and n_g is the number of objects in class "0" in bin g.



  3. We plot the curve by marking the values of P_g and E_g on the graph.


In an ideal model, all points on the Calibration Curve lie on the diagonal y = x. This means that if the model predicts a probability of 0.7, the event actually occurs in 70% of cases. If sections of the curve are above the diagonal (as shown in the example graph), it indicates that the model underestimates the actual probability of events. Conversely, if sections of the curve are below the diagonal, the model overestimates probabilities, meaning it is overly confident in its predictions.


The Calibration Curve visualizes how well the model predicts probabilities across different ranges. It helps identify which groups of objects the model struggles with the most and in which direction the errors occur. For example, in the example graph, the model significantly underestimates probabilities for objects it most confidently classifies as class "1." This suggests that while it assigns these objects to class "1," it is not sufficiently confident in its predictions.


The Calibration Curve does not provide insight into how well the model ranks objects by class membership. Therefore, it should be used in conjunction with an ROC or PR curve.


Another important limitation of the Calibration Curve lies in its construction. Since the algorithm involves splitting objects into bins based on predicted probabilities, there may be cases where individual points on the curve are based on a small number of objects. This reduces the statistical significance of the predicted and empirical probability estimates, making these sections of the curve less reliable.


From the points on the Calibration Curve, the Expected Calibration Error (ECE) metric can be calculated.


Expected Calibration Error (ECE) measures how much the model's predictions deviate from actual probabilities. It is a scalar value that aggregates the difference between the predicted probability and the observed frequency of successes.


where B is the number of bins, B_i is the number of predictions in bin i, and l is the total number of observations.


The lower the ECE, the better the model's calibration. The metric can range from "0" (for a perfect model) to "1" (for a poorly calibrated model).


Application: The Calibration Curve is used to evaluate the calibration of probabilistic models. It helps analyze how accurately the model predicts outcome probabilities. Since the metric is less sensitive to the quality of object ranking, it should be used alongside the ROC curve or PR curve.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve

# Example of data (replace with your data)
y_true = np.random.randint(0, 2, size=1000)  # True labels (0 or 1)
y_pred = np.random.rand(1000)  # Predicted probabilities

# Number of bins for grouping
n_bins = 10

# Plotting the Reliability Diagram
prob_true, prob_pred = calibration_curve(y_true, y_pred, n_bins=n_bins, strategy='uniform')

# Reliability Diagram plot
plt.figure(figsize=(8, 6))
plt.plot(prob_pred, prob_true, marker='o', label='Calibration Curve')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Perfect Calibration')
plt.title('Reliability Diagram (Calibration Curve)')
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
plt.legend()
plt.grid()
plt.show()

# Calculation of Expected Calibration Error (ECE)
def compute_ece(y_true, y_pred, n_bins=10):
    """Calculates Expected Calibration Error (ECE)"""
    bins = np.linspace(0, 1, n_bins + 1)
    bin_indices = np.digitize(y_pred, bins, right=True)
    ece = 0.0

    for i in range(1, n_bins + 1):
        bin_mask = bin_indices == i
        bin_size = bin_mask.sum()
        if bin_size > 0:
            bin_confidence = y_pred[bin_mask].mean()
            bin_accuracy = y_true[bin_mask].mean()
            ece += (bin_size / len(y_true)) * abs(bin_accuracy - bin_confidence)
    return ece

# Calculation of ECE (Expected Calibration Error)
ece = compute_ece(y_true, y_pred, n_bins=n_bins)
print(f"Expected Calibration Error (ECE): {ece:.4f}")


Hosmer-Lemeshow Curves and Statistics

Hosmer-Lemeshow Curves and Statistics are tools for assessing the calibration of probabilistic models and visualizing predicted probabilities.


These tools are not commonly found in articles or literature and may have different names in various sources. In practice, the curves are often referred to as a Gain Chart, and the statistic is known as the Hosmer-Lemeshow Test. However, this tool has proven effective in practice and is arguably one of the most informative methods for visualizing the quality of predicted probabilities.


How to Build Hosmer-Lemeshow Curves?

  1. Sort the Objects:

    Arrange the objects in ascending order of the probability of class "1" predicted by the model.


  2. Divide into Bins:

    Split the dataset into 10 bins of equal size.


  3. Calculate Average Probability and Frequency

    For each bin, calculate the following:

    • The average predicted probability:


    • The proportion of true positive outcomes (empirical probability) is calculated as: where p_g is the number of objects in class "1", and n_g is the number of objects in class "0" in bin g.


  4. Plotting the curves:
    Two graphs are created — one showing the dependence of P_g on the bin number and the other showing the dependence of E_g on the bin number.




Since all bins are of equal size, there is no issue with the statistical significance of the calculated E_g and P_g.


The analysis is based on how well the curves of the average predicted probability and the proportion of observed events align with each other. Here's how to determine if the model is good:


  1. The ( E_g ) curve (proportion of observed events) increases. This indicates that the predicted probabilities effectively rank objects by their likelihood of belonging to the class.
  2. The ( E_g ) curve is concave downward, and its values for the first and last bins are "close" to 0 and 1, respectively.This means the model confidently separates the classes.
  3. The points ( E_g ) and ( P_g ) (average predicted probability) are close to each other for corresponding bins. This indicates that the probabilities are well-calibrated.


These graphs help evaluate how well the model predicts probabilities and identify potential issues.


For instance, in the example above, the curve for a good model is shown: it is concave downward, and the bins are sorted in ascending order. In the first bin, the empirical probability ( E_g ) is close to 0, and in the last bin, it is close to 1, confirming the model's confidence in class separation. However, there are some noticeable issues:


  • From the first to the seventh bin, there is an underestimation of predicted probabilities, indicating calibration issues for objects with lower class "1" probabilities.

  • In the fourth bin, the proportion of class "1" objects ( E_g ) is lower than in the previous bins.

    This behavior may indicate data anomalies, labeling errors, or feature distribution peculiarities. These bins should be examined separately to understand the cause.


The Hosmer-Lemeshow statistic is built based on these graphs and is calculated as follows: where B is the number of bins, and B_i is the number of objects in bin i.


The Hosmer-Lemeshow statistic is compared with the critical value of the chi-squared distribution with B-2 degrees of freedom. However, it is rarely used in practice and is more commonly applied as a metric for comparing two models rather than as a statistical test.


Application: Hosmer-Lemeshow curves are an excellent tool for visualizing the quality of probability predictions. They help evaluate how the classifier predicts probabilities and identify problematic areas.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.isotonic import IsotonicRegression

# Generation of data with correlation between predictions and true values
np.random.seed(42)
base_probs = np.random.rand(1000)
y_true = np.random.choice([0, 1], size=1000, p=[0.7, 0.3])
correlated_probs = np.where(y_true == 1, base_probs + 0.3, base_probs - 0.3)
pred_probs = np.clip(correlated_probs, 0, 1)

# Creating a DataFrame
df = pd.DataFrame({
    'y_true': y_true,
    'pred_probs': pred_probs
})

# Applying isotonic regression for calibration
iso_reg = IsotonicRegression(out_of_bounds='clip')
calibrated_probs = iso_reg.fit_transform(df['pred_probs'], df['y_true'])

# Updating data with calibrated predictions
df['pred_probs_calibrated'] = calibrated_probs

# Reordering data by calibrated probability
df_sorted_calibrated = df.sort_values(by='pred_probs_calibrated').reset_index(drop=True)

# Splitting into equal-sized bins (deciles)
df_sorted_calibrated['bin'] = pd.qcut(df_sorted_calibrated.index, q=10, labels=False)

# Counting statistics for each bin
bin_stats_calibrated = df_sorted_calibrated.groupby('bin').agg(
    mean_predicted_prob=('pred_probs_calibrated', 'mean'),  # Average calibrated probability
    count_class_1=('y_true', 'sum'),                       # Number of objects in class "1"
    total_count=('y_true', 'count')                        # Total number of objects in the bin
).reset_index()

# Adding the proportion of true positive results
bin_stats_calibrated['empirical_prob'] = bin_stats_calibrated['count_class_1'] / bin_stats_calibrated['total_count']

# Plotting the graph for calibrated probabilities
plt.figure(figsize=(12, 7))

# Histogram of empirical probability
plt.bar(bin_stats_calibrated['bin'], bin_stats_calibrated['empirical_prob'], alpha=0.6, label="Empirical probability", color='blue')

# Curve of average calibrated probability.
plt.plot(bin_stats_calibrated['bin'], bin_stats_calibrated['mean_predicted_prob'], marker='o', label="Average predicted probability (calibration)", color='orange')

# Setting up the axes and legend.
plt.xlabel('Bin number')
plt.ylabel('Probability')
plt.title('Average calibrated probability and empirical probability by bins')
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.show()

bin_stats_calibrated


Conclusions

Depending on the goals of your analysis, different metrics can be used. However, for a comprehensive understanding of how the model predicts probabilities, it’s best to consider all of them. Here’s how they can be applied:


  • For evaluating class separation quality, i.e., how well the model ranks objects by class membership, metrics such as ROC-AUC and PR-AUC are suitable (the latter is especially relevant for tasks with imbalanced classes).
  • For visualizing and checking probability calibration, tools like Hosmer-Lemeshow curves and Calibration Curves are helpful. These provide a clear representation of how well the predicted probabilities align with actual outcomes.
  • For comparing models, metrics such as Log Loss, ROC-AUC, and ECE (Expected Calibration Error) are often used, with ECE specifically accounting for calibration errors.


This approach allows for a holistic evaluation of the model, identifying its strengths and weaknesses, and making informed decisions about the quality of your model.