Test Set Performance Bootstrap
Is the difference in the performance scores caused by the specific details of the test set?
Procedure
-
Evaluate Performance on the Train Set (Test Harness)
- Assess the performance of the model using cross-validation or other train-set evaluation techniques.
- Record key metrics (e.g., accuracy, precision, recall, F1-score, or RMSE) as the train set performance.
-
Evaluate Performance on the Hold-Out Test Set
- Measure the performance of the final chosen model on the test set.
- Compute the same metrics as used for the train set evaluation.
-
Bootstrap Test Set Performance
- Generate 30+ bootstrap samples from the test set to evaluate the model multiple times.
- Record the performance metrics for each bootstrap iteration to create a distribution of test set performance.
-
Quantify Performance Gap
- Calculate the absolute difference between the average performance on the train set (test harness) and the average performance from the test set bootstrap distribution.
- Compare the train set metric to the bootstrap test set distribution (e.g., check if it falls within the confidence interval or exceeds min/max bounds).
-
Interpret Results
- Identify whether the train set (test harness) performance lies within the bounds of the test set performance distribution.
- Note any significant deviations, such as overperformance or underperformance on the train set relative to the test set.
-
Summarize Insights and Recommendations
- Report the quantified performance gap and its implications for model reliability.
Limitations
-
Bootstrap Sampling Assumptions
- Results rely on the representativeness of bootstrap samples; unbalanced or small test sets can skew the bootstrap distribution.
-
Sample Size Dependence
- Small test sets or insufficient bootstrap samples may produce unreliable estimates of performance distribution.
-
Overfitting Risks
- If the train set is excessively optimized, the performance gap may falsely appear smaller due to unintentional overfitting to specific test set characteristics.
-
Metric Variability
- The choice of evaluation metric may influence the perceived performance gap. Use metrics aligned with the business or model objectives.
-
Context Sensitivity
- This diagnostic test is specific to the given dataset and task. Results may not generalize to other datasets or models without re-evaluation.
Code Example
This function evaluates a machine learning model’s performance using k-fold cross-validation on the training set and bootstrap sampling on the test set to establish and compare performance distributions.
import numpy as np
from sklearn.model_selection import cross_val_score
from scipy.stats import norm
from sklearn.utils import resample
def diagnostic_test_analysis(train, test, model, metric, k=10, n_bootstrap=100):
"""
Perform a diagnostic test to evaluate the generalization gap of a model.
Parameters:
train (tuple): Tuple containing (X_train, y_train).
test (tuple): Tuple containing (X_test, y_test).
model: The chosen machine learning model to evaluate.
metric: Performance metric function compatible with scikit-learn.
k (int): Number of folds for cross-validation.
n_bootstrap (int): Number of bootstrap samples for test set evaluation.
Returns:
dict: Dictionary containing results including performance gaps, test set distribution statistics,
and whether the test harness score falls within the test set performance distribution.
"""
X_train, y_train = train
X_test, y_test = test
# Step 1: Evaluate on training set (test harness) using k-fold cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=make_scorer(metric))
train_performance = np.mean(cv_scores)
# Step 2: Fit model on entire training set
model.fit(X_train, y_train)
# Step 3: Bootstrap evaluation on test set
bootstrap_scores = []
for _ in range(n_bootstrap):
X_bootstrap, y_bootstrap = resample(X_test, y_test, replace=True)
bootstrap_score = metric(y_bootstrap, model.predict(X_bootstrap))
bootstrap_scores.append(bootstrap_score)
# Calculate distribution statistics for bootstrap test set performance
test_mean = np.mean(bootstrap_scores)
test_std = np.std(bootstrap_scores)
ci_low, ci_high = norm.interval(0.95, loc=test_mean, scale=test_std / np.sqrt(n_bootstrap))
# Step 4: Compare train performance to test set distribution
within_distribution = ci_low <= train_performance <= ci_high
return {
"Train Performance (Test Harness)": train_performance,
"Test Set Mean": test_mean,
"Test Set Std Dev": test_std,
"95% CI": (ci_low, ci_high),
"Within Distribution": within_distribution,
"Bootstrap Scores": bootstrap_scores
}
# Demo Usage
if __name__ == "__main__":
# Generate synthetic dataset
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
X, y = make_classification(n_samples=1000, n_features=5, random_state=42)
X_train, X_test, y_train, y_test = X[:800], X[800:], y[:800], y[800:]
# Define model and metric
model = LogisticRegression()
metric = accuracy_score
# Run the diagnostic test analysis
results = diagnostic_test_analysis(
train=(X_train, y_train),
test=(X_test, y_test),
model=model,
metric=metric,
k=10,
n_bootstrap=100
)
# Print Results
print("Diagnostic Test Analysis Results:")
for key, value in results.items():
if key != "Bootstrap Scores":
print(f" - {key}: {value}")
Example Output
Diagnostic Test Analysis Results:
- Train Performance (Test Harness): 0.8525
- Test Set Mean: 0.8504
- Test Set Std Dev: 0.0123
- 95% CI: (0.8472, 0.8536)
- Within Distribution: True