Calculate Generalization Gap

Diagnostics

Performance Gap

What is the generalization gap for the chosen model on the chosen train and test sets?

Procedure

Train and Evaluate Models Using Test Harness
- Train the selected model(s) using a chosen test harness, such as k-fold cross-validation.
- Record the cross-validated performance metrics (e.g., accuracy, precision, recall, F1-score) from the test harness.
Evaluate Models on the Test Set
- Use the test set (unseen data) to evaluate the trained model.
- Record the performance metrics on the test set using the same metrics from the test harness.
Calculate the Generalization Gap
- Compute the absolute difference between the performance metric(s) from the test harness and the test set.
- For example, if accuracy in the test harness is 90% and on the test set is 85%, the generalization gap is 5%.
Report Results and Recommendations
- Summarize the findings, highlighting the generalization gap and its implications.

Limitations

Dependency on Metric Choice
- The interpretation of the generalization gap depends on the chosen metric; some metrics may be more sensitive to overfitting than others.
Dataset Size
- Small datasets can result in high variance in performance metrics, leading to unreliable gap calculations.
Test Set Representation
- If the test set does not accurately represent the real-world data distribution, the generalization gap may be misleading.
Complexity of the Model
- Highly complex models are more likely to exhibit larger generalization gaps, especially if trained on limited data.
Use Case Specificity
- Acceptable thresholds for the generalization gap vary depending on the application domain and the risk tolerance for errors.
Potential Over-Optimization
- Efforts to minimize the generalization gap could lead to underfitting, where the model sacrifices performance to generalize better.

Code Example

Below is a Python function that calculates the generalization gap using a test harness score from k-fold cross-validation and the test set score.

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

def calculate_generalization_gap(cv_score, test_score):
    """
    Calculate the generalization gap as the absolute difference between test harness score and test set score.

    Parameters:
        cv_score (float): The average score from cross-validation.
        test_score (float): The score on the test set.

    Returns:
        float: The generalization gap.
    """
    return np.abs(cv_score - test_score)

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic classification dataset
    X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Choose a model and metric
    model = RandomForestClassifier(random_state=42)
    metric = accuracy_score

    # Perform k-fold cross-validation (test harness)
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring="accuracy")
    mean_cv_score = np.mean(cv_scores)

    # Train model on the full training set and evaluate on the test set
    model.fit(X_train, y_train)
    test_predictions = model.predict(X_test)
    test_score = metric(y_test, test_predictions)

    # Calculate the generalization gap
    generalization_gap = calculate_generalization_gap(mean_cv_score, test_score)

    # Output results
    print("Generalization Gap Analysis:")
    print(f"  - Cross-Validation Score (mean): {mean_cv_score:.4f}")
    print(f"  - Test Set Score: {test_score:.4f}")
    print(f"  - Generalization Gap: {generalization_gap:.4f}")

Example Output

Generalization Gap Analysis:
  - Cross-Validation Score (mean): 0.9088
  - Test Set Score: 0.8800
  - Generalization Gap: 0.0288

Model Has Skill