Performance Gap Reference Range
Is the performance gap within the range of expected performance gaps of standard untuned machine learning models?
Procedure
-
Prepare Data and Models
- Split the dataset into train and test sets using a fixed ratio (e.g., 80/20).
- Select at least 10+ untuned standard machine learning models (e.g., logistic regression, decision trees, SVM, k-NN).
- Ensure the models cover a diverse range of algorithms and assumptions.
-
Train and Evaluate Models on the Train Set
- Train each of the selected models on the train set.
- Evaluate their performance on the test harness (e.g., through k-fold cross-validation).
- Record the cross-validated performance metrics for each model.
-
Evaluate Models on the Hold-Out Test Set
- Evaluate the same trained models on the hold-out test set.
- Record their performance metrics on this unseen data.
-
Calculate Generalization Gaps
- For each model, compute the generalization gap as the absolute difference between the test harness performance and the test set performance.
- Aggregate these gaps to create a reference distribution.
-
Assess the Chosen Model’s Generalization Gap
- Calculate the generalization gap for the chosen model.
- Compare this gap to the reference distribution of standard model gaps.
-
Evaluate Confidence Interval and Standard Error
- Determine the confidence interval and standard error of the reference distribution.
- Check if the chosen model’s gap falls within:
- The confidence interval of the reference distribution.
- The standard error of the reference distribution.
-
Summarize Results and Insights
- Report whether the chosen model’s generalization gap is consistent with the standard model gaps.
- Highlight any deviations that warrant further investigation.
- Provide recommendations if the gap suggests potential overfitting or underfitting.
Limitations
-
Diversity of Models
- Results depend on the diversity of standard models chosen; using similar models may reduce the utility of the reference distribution.
-
Dataset Size
- Small datasets may lead to unstable generalization gap estimates due to high variance in splits.
-
Untuned Models
- Using untuned models may not fully represent the performance range of well-optimized models, potentially skewing the reference distribution.
-
Test Set Leakage
- Repeated evaluations on the same test set can inadvertently cause overfitting to the test set performance.
-
Context Sensitivity
- The diagnostic is specific to the chosen dataset and may not generalize across datasets or tasks without adjustments.
Code Example
Below is a Python function that calculates the reference distribution of the generalization gap across 10+ untuned standard machine learning algorithms, defining the min/max gaps, distribution statistics, and confidence intervals.
import numpy as np
from sklearn.model_selection import cross_val_score
from scipy.stats import norm
from sklearn.metrics import make_scorer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
def generalization_gap_analysis(train, test, problem_type, metric, k=10):
"""
Calculate the reference distribution of generalization gaps across standard untuned ML algorithms.
Parameters:
train (tuple): Tuple containing (X_train, y_train).
test (tuple): Tuple containing (X_test, y_test).
problem_type (str): "classification" or "regression".
metric: Performance metric function compatible with scikit-learn.
k (int): Number of folds for cross-validation.
Returns:
dict: Dictionary containing min/max gaps, mean/median, and confidence interval for generalization gaps.
"""
X_train, y_train = train
X_test, y_test = test
# Define standard untuned models based on the problem type
if problem_type == "classification":
models = [
LogisticRegression(), DecisionTreeClassifier(), RandomForestClassifier(),
GradientBoostingClassifier(), SVC(), KNeighborsClassifier(),
DummyClassifier(strategy="most_frequent")
]
elif problem_type == "regression":
models = [
LinearRegression(), Ridge(), RandomForestRegressor(),
GradientBoostingClassifier(), SVR(), KNeighborsRegressor(),
DummyRegressor(strategy="mean")
]
else:
raise ValueError("Invalid problem type. Choose 'classification' or 'regression'.")
# Calculate gaps for each model
gaps = []
for model in models:
# Cross-validation performance
cv_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=make_scorer(metric))
model.fit(X_train, y_train)
# Test set performance
test_score = metric(y_test, model.predict(X_test))
# Generalization gap
gap = np.abs(np.mean(cv_scores) - test_score)
gaps.append(gap)
# Calculate distribution statistics
mean_gap = np.mean(gaps)
median_gap = np.median(gaps)
std_gap = np.std(gaps)
ci_low, ci_high = norm.interval(0.95, loc=mean_gap, scale=std_gap / np.sqrt(len(gaps)))
return {
"Min Gap": np.min(gaps),
"Max Gap": np.max(gaps),
"Mean Gap": mean_gap,
"Median Gap": median_gap,
"95% CI": (ci_low, ci_high)
}
# Demo Usage
if __name__ == "__main__":
# Generate synthetic dataset
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
X, y = make_classification(n_samples=1000, n_features=5, random_state=42)
X_train, X_test, y_train, y_test = X[:800], X[800:], y[:800], y[800:]
# Define problem type and metric
problem_type = "classification"
metric = accuracy_score
# Run the generalization gap analysis
results = generalization_gap_analysis(
train=(X_train, y_train),
test=(X_test, y_test),
problem_type=problem_type,
metric=metric,
k=10
)
# Print Results
print("Generalization Gap Analysis Results:")
for key, value in results.items():
print(f" - {key}: {value}")
Example Output
Generalization Gap Analysis Results:
- Min Gap: 0.01750000000000007
- Max Gap: 0.08750000000000008
- Mean Gap: 0.03732142857142861
- Median Gap: 0.03249999999999997
- 95% CI: (np.float64(0.02111231557636773), np.float64(0.05353054156648949))