Test Harness Performance Distribution
Is the difference in performance caused by the variance of the test harness?
Procedure
-
Train the Chosen Model on the Test Harness
- Use the chosen model and train it multiple times on the training set using a sampling techniques such as k-fold cross-validation, bootstrapping, or repeated train/test splits.
- Record the model’s performance for each training iteration to build a distribution of expected performance metrics (e.g., accuracy, RMSE, F1 score).
-
Calculate the Distribution of Expected Performance
- From the results of the training on the test harness, compute the distribution of the performance metric.
- Derive key statistics such as the mean, standard deviation, confidence intervals, and min/max values of the distribution.
-
Fit the Chosen Model on the Entire Training Set
- Train the chosen model on the entire training set without sampling or splitting.
- Ensure all preprocessing steps are consistent with those used during the test harness evaluation phase.
-
Evaluate the Chosen Model on the Test Set
- Apply the fully trained model to the unseen test set.
- Record the performance metric as a point estimate.
-
Compare Test Set Performance to the Training Distribution
- Check if the test set performance falls within the range or confidence interval of the distribution derived from the test harness.
- Assess the generalization gap by comparing the test set point estimate to the mean performance of the test harness distribution.
-
Interpret the Results
- Report whether the test set performance aligns with expectations and any anomalies observed.
Limitations
-
Dependence on Sampling Methodology
- The reliability of the test harness performance distribution depends on the robustness of the sampling methodology (e.g., k-fold cross-validation or bootstrapping).
-
Dataset Size
- Small training sets may result in highly variable performance distributions, reducing confidence in the comparison.
-
Sensitivity to Outliers
- Outliers in the test set or training distribution can distort results and lead to misleading conclusions about generalization performance.
-
Single Test Set
- The evaluation is limited to a single test set, which may not fully represent the model’s ability to generalize to unseen data.
-
Assumptions of Consistency
- The method assumes that the train and test data come from the same distribution. Any shift in the data distribution may render the comparison invalid.
Code Example
Below is a Python function that evaluates a chosen model’s test set performance, and compares it against the distribution of scores from k-fold cross-validation on the training set.
import numpy as np
from sklearn.model_selection import cross_val_score
from scipy.stats import norm
from sklearn.utils import resample
from sklearn.metrics import make_scorer
def diagnostic_test_analysis(train, test, model, metric, k=10, n_bootstrap=100):
"""
Perform a diagnostic test to evaluate the generalization gap of a model.
Parameters:
train (tuple): Tuple containing (X_train, y_train).
test (tuple): Tuple containing (X_test, y_test).
model: The chosen machine learning model to evaluate.
metric: Performance metric function compatible with scikit-learn.
k (int): Number of folds for cross-validation.
n_bootstrap (int): Number of bootstrap samples for test set evaluation.
Returns:
dict: Dictionary containing results including performance gaps, test set distribution statistics,
and whether the test harness score falls within the test set performance distribution.
"""
X_train, y_train = train
X_test, y_test = test
# Step 1: Evaluate on training set (test harness) using k-fold cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=make_scorer(metric))
train_performance = np.mean(cv_scores)
# Step 2: Fit model on entire training set
model.fit(X_train, y_train)
# Step 3: Bootstrap evaluation on test set
bootstrap_scores = []
for _ in range(n_bootstrap):
X_bootstrap, y_bootstrap = resample(X_test, y_test, replace=True)
bootstrap_score = metric(y_bootstrap, model.predict(X_bootstrap))
bootstrap_scores.append(bootstrap_score)
# Calculate distribution statistics for bootstrap test set performance
test_mean = np.mean(bootstrap_scores)
test_std = np.std(bootstrap_scores)
ci_low, ci_high = norm.interval(0.95, loc=test_mean, scale=test_std / np.sqrt(n_bootstrap))
# Step 4: Compare train performance to test set distribution
within_distribution = ci_low <= train_performance <= ci_high
return {
"Train Performance (Test Harness)": train_performance,
"Test Set Mean": test_mean,
"Test Set Std Dev": test_std,
"95% CI": (ci_low, ci_high),
"Within Distribution": within_distribution,
"Bootstrap Scores": bootstrap_scores
}
# Demo Usage
if __name__ == "__main__":
# Generate synthetic dataset
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
X, y = make_classification(n_samples=1000, n_features=5, random_state=42)
X_train, X_test, y_train, y_test = X[:800], X[800:], y[:800], y[800:]
# Define model and metric
model = LogisticRegression()
metric = accuracy_score
# Run the diagnostic test analysis
results = diagnostic_test_analysis(
train=(X_train, y_train),
test=(X_test, y_test),
model=model,
metric=metric,
k=10,
n_bootstrap=100
)
# Print Results
print("Diagnostic Test Analysis Results:")
for key, value in results.items():
if key != "Bootstrap Scores":
print(f" - {key}: {value}")
Example Output
Diagnostic Test Analysis Results:
- Train Performance (Test Harness): 0.86625
- Test Set Mean: 0.8449499999999999
- Test Set Std Dev: 0.027189106274388634
- 95% CI: (np.float64(0.8396210330930365), np.float64(0.8502789669069633))
- Within Distribution: False