Chapter 7 Model Evaluation

7.1 Classification Models: Evaluation

My medium story

Google developers

7.1.1 Thresholding

Logistic regression returns a probability. You can use the returned probability “as is” (for example, the probability that the user will click on this ad is 0.00023) or convert the returned probability to a binary value (for example, this email is spam).

A logistic regression model that returns 0.9995 for a particular email message is predicting that it is very likely to be spam. Conversely, another email message with a prediction score of 0.0003 on that same logistic regression model is very likely not spam.

However, what about an email message with a prediction score of 0.6? In order to map a logistic regression value to a binary category, you must define a classification threshold (also called the decision threshold).

A value above that threshold indicates “spam”; a value below indicates “not spam.” It is tempting to assume that the classification threshold should always be 0.5, but thresholds are problem-dependent, and are therefore values that you must tune.

Note: “Tuning” a threshold for logistic regression is different from tuning hyperparameters such as learning rate. Part of choosing a threshold is assessing how much you’ll suffer for making a mistake. For example, mistakenly labeling a non-spam message as spam is very bad. However, mistakenly labeling a spam message as non-spam is unpleasant, but hardly the end of your job.

7.1.2 Confusion Matrix

Predicted Positive Predicted Negative
Actual Positive TP FN
Actual Negative FP TN

True Positive: Model predicted positive and it is true.

True negative: Model predicted negative and it is true.

False positive (Type 1 Error): Model predicted positive but it is false.

False negative (Type 2 Error): Model predicted negative and it is true.

7.1.2.1 False Positive Rate (FPR):

The False Positive Rate is the ratio of false positive predictions to the total number of actual negatives. It measures the rate at which the model incorrectly predicts the positive class among the instances that are actually negative.

\(FPR = \frac{FP}{TN + FP}\)

7.1.2.2 True Positive Rate (TPR), Sensitivity, or Recall:

The True Positive Rate is the ratio of true positive predictions to the total number of actual positives. It measures the ability of the model to correctly predict the positive class among instances that are actually positive.

Recall (TPR) \(= \frac{TP}{TP + FN}\)

7.1.2.3 Accuracy:

It represents the ratio of correctly predicted instances to the total number of instances. The accuracy metric is suitable for balanced datasets where the classes are evenly distributed. It is calculated using the following formula:

Accuracy \(= \frac{TP + TN}{TP + TN + FP + FN}\)

Accuracy provides a general sense of how well a model is performing across all classes. It is easy to understand and interpret, making it a commonly used metric, especially when the classes are balanced.

However, accuracy may not be an ideal metric in situations where the class distribution is imbalanced. In imbalanced datasets, where one class significantly outnumbers the other, a high accuracy might be achieved by simply predicting the majority class. In such cases, other metrics like precision, recall, F1 score, or area under the receiver operating characteristic (ROC-AUC) curve may be more informative.

7.1.2.4 Precision:

Precision is the ratio of true positive predictions to the total number of positive predictions made by the model. It represents the accuracy of the positive predictions made by the model.

\(Precision = \frac{TP}{TP + FP}\)

F1 Measure:

The F1 score is a metric commonly used in binary classification to provide a balance between precision and recall. It is the harmonic mean of precision and recall, combining both measures into a single value. The F1 score is particularly useful when there is an uneven class distribution or when both false positives and false negatives are important considerations.

The F1 score is useful in situations where achieving a balance between precision and recall is important, as it penalizes models that have a significant imbalance between these two metrics.

\(F1 score = 2 \times \frac{Precision \times Recall}{Precision + Recall}\)

7.1.2.5 In Marketing

In marketing, the choice between optimizing for precision or recall depends on the specific business objectives and the costs associated with false positives and false negatives:

Precision is prioritized when the cost of targeting non-lookalikes is high, and we want to ensure that most of the targeted individuals are genuine lookalikes.

Recall is prioritized when the cost of missing a potential lookalike (lost opportunity) is high, and we want to capture as many true lookalikes as possible, even if it means including some non-lookalikes.

The decision on which metric to prioritize is driven by the campaign’s context and goals.”

Precision:

Precision is the ratio of correctly identified positives (true lookalikes) to all instances that were predicted as positives (both true and false lookalikes).

In marketing, precision is valuable when the cost or impact associated with false positives (incorrectly identifying a non-lookalike as a lookalike) is high.

Example: If targeting a non-lookalike with a marketing campaign incurs significant costs (e.g., sending out costly promotions or ads to uninterested users), you want to minimize false positives. High precision ensures that most of the people you target are actual lookalikes, thus reducing wasted marketing spend.

Recall:

Recall is the ratio of correctly identified positives (true lookalikes) to all actual positives (true lookalikes).

In marketing, recall is important when you want to ensure that you are not missing potential opportunities (actual lookalikes).

Example: If missing a true lookalike (a customer who is likely to respond positively to a campaign) results in a high cost or lost opportunity (e.g., missed revenue or engagement), you want to maximize recall. High recall ensures that most of the potential lookalikes are captured by the model, even if some non-lookalikes are incorrectly included.

7.2 ROC Curve

The ROC (Receiver Operating Characteristic) curve is a graphical tool used to evaluate the performance of binary classification models. It helps in understanding how well a model distinguishes between two classes.

The ROC curve helps visualize the trade-offs between true positive rate and false positive rate across different thresholds. By analyzing the ROC curve, considering business costs, and using metrics like Youden’s Index, you can select a probability threshold that balances performance according to your specific needs.

7.2.1 Components of the ROC Curve:

  • True Positive Rate (TPR) / Sensitivity / Recall:

  • Measures the proportion of actual positive cases that are correctly identified by the model.

  • Formula: \[ \text{TPR} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} \]

  • False Positive Rate (FPR):

  • Measures the proportion of actual negative cases that are incorrectly classified as positive by the model.

  • Formula: \[ \text{FPR} = \frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}} \]

7.2.2 How to Read the ROC Curve:

  • X-axis: False Positive Rate (FPR) – The rate at which negative cases are incorrectly classified as positive.

  • Y-axis: True Positive Rate (TPR) – The rate at which positive cases are correctly identified.

  • A perfect classifier would be represented by a point in the upper-left corner of the plot, indicating high TPR and low FPR.

  • A random classifier would produce a diagonal line from the bottom-left to the top-right of the plot, indicating no discriminative power.

7.2.3 Area Under the ROC Curve (AUC):

  • AUC represents the probability that the model ranks a randomly chosen positive case higher than a randomly chosen negative case.

  • AUC Values:

    • AUC = 0.5: No discriminative power (model performs as well as random guessing).

    • 0.7 < AUC < 0.8: Fair performance.

    • 0.8 < AUC < 0.9: Good performance.

    • AUC > 0.9: Excellent performance.

7.2.4 Applications of ROC Curve:

  • Model Evaluation: The ROC curve helps compare different models and choose the one with the best trade-off between true positive rate and false positive rate.

  • Threshold Selection: It aids in selecting the optimal probability threshold for classification, balancing the rate of true positives and false positives.

7.2.5 Using the ROC Curve in Real Examples

The ROC curve is a valuable tool for evaluating and selecting the optimal probability threshold in binary classification problems. Here’s how you can use the ROC curve in practice and select an appropriate threshold:

7.2.5.1 Train Your Model:

  • Fit your binary classification model to the training data.

7.2.5.2 Predict Probabilities:

  • Use the model to predict probabilities of the positive class for the test data. These probabilities are used to assess the performance of the model at various thresholds.

7.2.5.3 Compute True Positive Rate (TPR) and False Positive Rate (FPR):

  • For different threshold values (from 0 to 1), calculate the TPR and FPR. This involves varying the classification threshold and computing the confusion matrix for each threshold.

7.2.5.4 Plot the ROC Curve:

  • Plot the TPR (on the y-axis) against the FPR (on the x-axis) for each threshold value. This gives you the ROC curve.

7.2.5.5 Calculate the Area Under the Curve (AUC):

  • The AUC provides a summary measure of the model’s performance. A higher AUC indicates better model performance.

7.2.6 Selecting the Probability Threshold:

Choosing the right probability threshold is crucial for optimizing your model’s performance based on your specific needs. Here’s how to select an appropriate threshold:

7.2.6.1 Visual Inspection:

  • Look at the ROC Curve: Find the point on the ROC curve that is closest to the top-left corner (where TPR is high and FPR is low). This point represents a good trade-off between sensitivity and specificity.

7.2.6.2 Consider the Business Context:

  • Cost-Benefit Analysis: If the cost of false positives is high (e.g., wasted marketing spend), you might prefer a threshold that minimizes FPR. Conversely, if missing true positives is costly (e.g., lost revenue), you might choose a lower threshold to increase TPR.

  • Decision-Making Criteria: Determine the acceptable levels of TPR and FPR based on business requirements. For example, in a medical diagnosis context, you might prefer higher recall (sensitivity) to ensure no patient with a condition is missed, even if it means higher false positives.

7.2.6.3 Optimization Metrics:

  • Youden’s Index: Calculate Youden’s Index (\(J\)) which is defined as: \[ J = \text{TPR} - \text{FPR} \] The threshold corresponding to the maximum value of \(J\) can be chosen as it represents the best trade-off between TPR and FPR.

7.2.6.4 Confusion Matrix Analysis:

  • Evaluate Different Thresholds: For each threshold, compute the confusion matrix and analyze precision, recall, and F1-score. Choose the threshold that best aligns with your performance goals.

7.2.6.5 Cross-Validation:

  • Cross-Validation: Use cross-validation to ensure that the chosen threshold performs well across different subsets of the data. This helps in generalizing the model’s performance and avoiding overfitting.

7.2.7 ROC Curve Example:

Let’s consider an example where you have a model predicting whether an email is spam or not:

  1. Train the Model: You train a logistic regression model to classify emails as spam or not spam.

  2. Predict Probabilities: The model outputs probabilities for each email being spam.

  3. Compute TPR and FPR: Calculate TPR and FPR for various thresholds (e.g., 0.1, 0.2, …, 0.9).

  4. Plot the ROC Curve: Plot TPR against FPR for each threshold value.

  5. Select Threshold:

    • Visual Inspection: Identify the threshold where the ROC curve is closest to the top-left corner.

    • Business Context: If false positives (non-spam emails marked as spam) lead to user dissatisfaction, you might prefer a higher threshold to reduce FPR.

    • Optimization: Calculate Youden’s Index and select the threshold with the highest value.

  6. Implement and Monitor: Set the chosen threshold in your production system and monitor its performance. Adjust as needed based on real-world feedback and performance metrics.

7.3 Overfitting

Overfitting occurs when a model learns the noise and details of the training data too well, resulting in poor generalization to new, unseen data.

7.3.1 How Do You Overcome Overfitting?

To overcome overfitting:

  • Regularization: Apply techniques like L1 (Lasso) and L2 (Ridge) regularization to penalize large coefficients, which helps prevent the model from becoming overly complex.

  • Early Stopping: When training models like XGBoost or neural networks, use early stopping to halt training once the model’s performance on a validation set stops improving.

  • Reduce Model Complexity: For tree-based models, limit the depth of trees, reduce the number of features, or decrease the number of trees (estimators) to simplify the model.

  • Pruning: For decision trees, apply post-pruning or pre-pruning techniques to cut off parts of the tree that provide little to no predictive power.

  • Ensemble Methods with Bagging: Techniques like Random Forest use bagging to reduce variance by averaging multiple decision trees trained on different random subsets of data.

7.3.2 Data Stratification Technique

Stratification is a technique used during data splitting to ensure that the training and test sets are representative of the overall distribution of the target variable. This is particularly important when the target variable is imbalanced.

  • Stratified Sampling:

    • When splitting data into training and testing sets, use stratified sampling to maintain the same proportion of each class in both sets as in the overall dataset. This ensures that both the training and test sets are representative of the overall population.

    • Stratification can be done for classification problems where the target variable is categorical, ensuring that minority and majority classes are adequately represented in both sets.

  • K-Fold Stratified Cross-Validation:

    • Instead of regular k-fold cross-validation, use stratified k-fold cross-validation to ensure that each fold has approximately the same percentage of samples of each target class as the complete dataset. This helps in better generalization, especially with imbalanced data.

7.3.3 Any Other Way to Simplify the Model?

Simplifying the model can help prevent overfitting by reducing its capacity to learn overly complex patterns from the data. Some strategies include:

  • Feature Selection:

    • Remove irrelevant or redundant features to reduce model complexity. Techniques like Recursive Feature Elimination (RFE), LASSO regularization, and mutual information can help identify important features.
  • Dimensionality Reduction:

    • Apply techniques like Principal Component Analysis (PCA) or t-Distributed Stochastic Neighbor Embedding (t-SNE) to reduce the dimensionality of the data. This helps in simplifying the model and reducing the risk of overfitting.
  • Parameter Tuning:

    • For models like Decision Trees and XGBoost, tuning parameters such as max_depth, min_child_weight, gamma, and subsample can help simplify the model by controlling how much it learns from the data.

7.3.4 4. Are You Using Cross-Validation Method?

Yes, cross-validation is a critical method to evaluate and improve model performance, especially for preventing overfitting:

  • K-Fold Cross-Validation:
    • The dataset is divided into k subsets (folds). The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times, with each fold used as the test set once. The performance metrics are then averaged across all k iterations to get a more reliable estimate of model performance.
    • Common values for k are 5 or 10, but they can be adjusted based on the dataset size.
  • Stratified K-Fold Cross-Validation:
    • As mentioned earlier, it ensures that each fold is representative of the class distribution, making it particularly useful for imbalanced datasets.
  • Leave-One-Out Cross-Validation (LOOCV):
    • This is a special case of k-fold cross-validation where k equals the number of samples in the dataset. It is more computationally expensive but provides a nearly unbiased estimate of the model’s performance.

By combining these techniques—regularization, stratification, feature selection, dimensionality reduction, and cross-validation—you can significantly reduce the risk of overfitting and build more robust machine learning models.

Would you like more details on any of these points?

7.4 Bias-Variance Tradeoff

The error-variance tradeoff refers to the balance between two sources of error that affect the performance of a machine learning model: bias and variance.

Understanding this tradeoff is key to building models that generalize well to new data.

In machine learning, the total error (or loss) of a model can be decomposed into three parts:

  1. Bias: Error due to overly simplistic assumptions in the model.

  2. Variance: Error due to excessive sensitivity to small fluctuations in the training data.

  3. Irreducible Error: Error that cannot be reduced regardless of the model. This is typically noise in the data.

The goal of machine learning is to minimize both bias and variance to achieve good generalization on unseen data.

7.4.1 Key Concepts in Bias-Variance Tradeoff

  1. Bias:

    • Bias represents the error due to simplifying assumptions made by the model to make the target function easier to learn.

    • High Bias occurs when a model is too simple, underfitting the training data. For example, using a linear model to fit data that has a nonlinear pattern results in high bias because the model cannot capture the complexity of the data.

    • Characteristics of High Bias Models:

      • Poor performance on training data.

      • Poor performance on validation/test data.

      • Example Models: Linear Regression, Logistic Regression with limited features.

  2. Variance:

    • Variance represents the model’s sensitivity to fluctuations in the training data. A model with high variance pays too much attention to the noise in the training data.

    • High Variance occurs when a model is too complex, overfitting the training data. The model captures the noise in the training data, making it perform well on training data but poorly on new data.

    • Characteristics of High Variance Models:

      • Excellent performance on training data.

      • Poor performance on validation/test data.

      • Example Models: Decision Trees without pruning, High-degree polynomial regression.

  3. Irreducible Error:

    • This is the inherent error in the problem itself, such as random noise in the data that cannot be explained by any model. It represents the lowest possible error that can be achieved.

7.4.2 Error Decomposition and Tradeoff

The expected error of a model can be broken down as follows:

\[ \text{Expected Error} = (\text{Bias})^2 + \text{Variance} + \text{Irreducible Error} \]

  • Low Bias and High Variance: A model like a deep decision tree may have low bias (fits training data well) but high variance (poor generalization to new data).

  • High Bias and Low Variance: A model like linear regression may have high bias (oversimplifies the data) but low variance (less sensitivity to small changes in data).

7.4.3 Managing the Bias-Variance Tradeoff

To achieve a good balance between bias and variance:

  1. Regularization:
    • Techniques like Ridge (L2) and Lasso (L1) regularization add a penalty term to the model loss function to prevent overfitting, reducing variance at the cost of slightly increasing bias.
  2. Model Complexity:
    • Select an appropriate model complexity that balances bias and variance. For example, in polynomial regression, choose a degree that isn’t too low (high bias) or too high (high variance).
  3. Cross-Validation:
    • Use k-fold cross-validation to evaluate model performance and detect high variance or high bias. This provides a more reliable estimate of the model’s generalization error.
  4. Ensemble Methods:
    • Techniques like Bagging (e.g., Random Forest) reduce variance by averaging predictions from multiple models. Boosting methods like XGBoost focus on reducing bias by sequentially learning from mistakes.
  5. Feature Selection:
    • Simplify the model by removing irrelevant or redundant features to prevent overfitting, reducing variance.

7.4.4 Conclusion

  • High Bias, Low Variance: Simple models that do not learn the complexity of the data well. Risk: Underfitting.

  • Low Bias, High Variance: Complex models that learn the training data too well, including noise. Risk: Overfitting.

The bias-variance tradeoff involves finding the right balance between these two to minimize the total error. The ideal model will have a good balance of bias and variance, leading to the lowest possible error on unseen data.

7.4.5 Lift Chart

A Lift Chart is a visual tool used in predictive modeling, particularly for evaluating the effectiveness of classification models in binary outcomes (e.g., customer purchase vs. non-purchase).

  • Definition:

    • A lift chart shows the improvement (or “lift”) of a model’s predictions compared to a random baseline. It helps to understand how much better the model is at identifying positive outcomes than a random guess.
  • Components:

    • X-axis: Percentage of data points (e.g., customers) sorted by predicted probability of being positive.

    • Y-axis: Cumulative number or percentage of true positives.

  • How to Interpret:

    • A perfect model would capture all positives in the first few data points, resulting in a steep curve.

    • A random model will produce a diagonal line (45-degree), where the percentage of positives equals the percentage of the population.

    • Lift is calculated as the ratio of the cumulative positives identified by the model to the cumulative positives identified by a random model at any given point.

  • Use Case:

    • Lift charts are commonly used in marketing to identify customers most likely to respond to a campaign. A lift of 3, for instance, would mean the model is three times better than random guessing at identifying potential respondents.

7.4.6 ROC Curve (Receiver Operating Characteristic Curve)

An ROC Curve is a graphical representation used to evaluate the performance of binary classifiers. It shows the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR) at different threshold settings.

  • Definition:
    • True Positive Rate (TPR) / Sensitivity: The proportion of actual positives correctly identified by the model. \[ \text{TPR} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} \]
    • False Positive Rate (FPR): The proportion of actual negatives incorrectly classified as positives. \[ \text{FPR} = \frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}} \]
  • How to Read the ROC Curve:
    • X-axis: False Positive Rate (FPR).
    • Y-axis: True Positive Rate (TPR).
    • A point in the upper left corner represents a perfect classifier (100% TPR and 0% FPR).
    • A diagonal line from (0,0) to (1,1) represents a random classifier.
  • Area Under the ROC Curve (AUC):
    • The AUC is a single scalar value between 0 and 1 that represents the model’s ability to discriminate between positive and negative classes.
    • AUC = 0.5: The model is no better than random guessing.
    • AUC = 1: The model is perfect.
    • Higher AUC: Better model performance.
  • Use Case:
    • ROC curves and AUC are widely used in fields like medical diagnosis, fraud detection, and any domain where distinguishing between two classes is important.

7.4.7 Summary

  • Mutual Information helps in feature selection by quantifying the dependency between variables.
  • Lift Chart evaluates the effectiveness of classification models, showing the improvement over a random guess.
  • ROC Curve and AUC provide insight into the model’s ability to distinguish between classes, with a focus on sensitivity and specificity.

Would you like more details or examples on any of these concepts?

7.4.8 Bootstrapping

7.4.8.1 Jack-knife

• The jackknife is a tool for estimating standard errors and the bias of estimators

• As its name suggests, the jackknife is a small, handy tool; in contrast to the bootstrap, which is then the moral equivalent of a giant workshop full of tools

• Both the jackknife and the bootstrap involve re-sampling data; that is, repeatedly creating new data sets from the original data

The jackknife deletes each observation and calculates an estimate based on the remaining n − 1 of them

• It uses this collection of estimates to do things like estimate the bias and the standard error

• Note that estimating the bias and having a standard error are not needed for things like sample means, which we know are unbiased estimates of population means and what their standard errors are

It has been shown that the jackknife is a linear approximation to the bootstrap

• Generally do not use the jackknife for sample quantiles like the median; as it has been shown to have some poor properties

The bootstrap

• The bootstrap is a tremendously useful tool for constructing confidence intervals and calculating standard errors for difficult statistics • For example, how would one derive a confidence interval for the median? • The bootstrap procedure follows from the so called bootstrap principle

Suppose that I have a statistic that estimates some population parameter, but I don’t know its sampling distribution • The bootstrap principle suggests using the distribution defined by the data to approximate its sampling distribution • In practice, the bootstrap principle is always carried out using simulation • The general procedure follows by first simulating complete data sets from the observed data with replacement • This is approximately drawing from the sampling distribution of that statistic, at least as far as the data is able to approximate the true population distribution • Calculate the statistic for each simulated data set • Use the simulated statistics to either define a confidence interval or take the standard deviation to calculate a standard error Example • Consider again, the data set of 630 measurements of gray matter volume for workers from a lead manufacturing plant • The median gray matter volume is around 589 cubic centimeters • We want a confidence interval for the median of these measurements • Bootstrap procedure for calculating for the median from a data set of n observations i. Sample n observations with replacement from the observed data resulting in one simulated complete data set ii. Take the median of the simulated data set iii. Repeat these two steps B times, resulting in B simulated medians iv. These medians are approximately draws from the sampling distribution of the median of n observations; therefore we can • Draw a histogram of them • Calculate their standard deviation to estimate the standard error of the median • Take the 2.5th and 97.5th percentiles as a confidence interval for the median

Summary • The bootstrap is non-parametric • However, the theoretical arguments proving the validity of the bootstrap rely on large samples • Better percentile bootstrap confidence intervals correct for bias • There are lots of variations on bootstrap procedures; the book “An Introduction to the Bootstrap” by Efron and Tibshirani is a great place to start for both bootstrap and jackknife information