Calculate Generalization Gap
What is the generalization gap for the chosen model on the chosen train and test sets?
Procedure
-
Train and Evaluate Models Using Test Harness
- Train the selected model(s) using a chosen test harness, such as k-fold cross-validation.
- Record the cross-validated performance metrics (e.g., accuracy, precision, recall, F1-score) from the test harness.
-
Evaluate Models on the Test Set
- Use the test set (unseen data) to evaluate the trained model.
- Record the performance metrics on the test set using the same metrics from the test harness.
-
Calculate the Generalization Gap
- Compute the absolute difference between the performance metric(s) from the test harness and the test set.
- For example, if accuracy in the test harness is 90% and on the test set is 85%, the generalization gap is 5%.
-
Report Results and Recommendations
- Summarize the findings, highlighting the generalization gap and its implications.
Limitations
-
Dependency on Metric Choice
- The interpretation of the generalization gap depends on the chosen metric; some metrics may be more sensitive to overfitting than others.
-
Dataset Size
- Small datasets can result in high variance in performance metrics, leading to unreliable gap calculations.
-
Test Set Representation
- If the test set does not accurately represent the real-world data distribution, the generalization gap may be misleading.
-
Complexity of the Model
- Highly complex models are more likely to exhibit larger generalization gaps, especially if trained on limited data.
-
Use Case Specificity
- Acceptable thresholds for the generalization gap vary depending on the application domain and the risk tolerance for errors.
-
Potential Over-Optimization
- Efforts to minimize the generalization gap could lead to underfitting, where the model sacrifices performance to generalize better.
Code Example
Below is a Python function that calculates the generalization gap using a test harness score from k-fold cross-validation and the test set score.
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
def calculate_generalization_gap(cv_score, test_score):
"""
Calculate the generalization gap as the absolute difference between test harness score and test set score.
Parameters:
cv_score (float): The average score from cross-validation.
test_score (float): The score on the test set.
Returns:
float: The generalization gap.
"""
return np.abs(cv_score - test_score)
# Demo Usage
if __name__ == "__main__":
# Generate synthetic classification dataset
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Choose a model and metric
model = RandomForestClassifier(random_state=42)
metric = accuracy_score
# Perform k-fold cross-validation (test harness)
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring="accuracy")
mean_cv_score = np.mean(cv_scores)
# Train model on the full training set and evaluate on the test set
model.fit(X_train, y_train)
test_predictions = model.predict(X_test)
test_score = metric(y_test, test_predictions)
# Calculate the generalization gap
generalization_gap = calculate_generalization_gap(mean_cv_score, test_score)
# Output results
print("Generalization Gap Analysis:")
print(f" - Cross-Validation Score (mean): {mean_cv_score:.4f}")
print(f" - Test Set Score: {test_score:.4f}")
print(f" - Generalization Gap: {generalization_gap:.4f}")
Example Output
Generalization Gap Analysis:
- Cross-Validation Score (mean): 0.9088
- Test Set Score: 0.8800
- Generalization Gap: 0.0288