Test Set Performance Bootstrap

Diagnostics

Performance Gap

Is the difference in the performance scores caused by the specific details of the test set?

Procedure

Evaluate Performance on the Train Set (Test Harness)
- Assess the performance of the model using cross-validation or other train-set evaluation techniques.
- Record key metrics (e.g., accuracy, precision, recall, F1-score, or RMSE) as the train set performance.
Evaluate Performance on the Hold-Out Test Set
- Measure the performance of the final chosen model on the test set.
- Compute the same metrics as used for the train set evaluation.
Bootstrap Test Set Performance
- Generate 30+ bootstrap samples from the test set to evaluate the model multiple times.
- Record the performance metrics for each bootstrap iteration to create a distribution of test set performance.
Quantify Performance Gap
- Calculate the absolute difference between the average performance on the train set (test harness) and the average performance from the test set bootstrap distribution.
- Compare the train set metric to the bootstrap test set distribution (e.g., check if it falls within the confidence interval or exceeds min/max bounds).
Interpret Results
- Identify whether the train set (test harness) performance lies within the bounds of the test set performance distribution.
- Note any significant deviations, such as overperformance or underperformance on the train set relative to the test set.
Summarize Insights and Recommendations
- Report the quantified performance gap and its implications for model reliability.

Limitations

Bootstrap Sampling Assumptions
- Results rely on the representativeness of bootstrap samples; unbalanced or small test sets can skew the bootstrap distribution.
Sample Size Dependence
- Small test sets or insufficient bootstrap samples may produce unreliable estimates of performance distribution.
Overfitting Risks
- If the train set is excessively optimized, the performance gap may falsely appear smaller due to unintentional overfitting to specific test set characteristics.
Metric Variability
- The choice of evaluation metric may influence the perceived performance gap. Use metrics aligned with the business or model objectives.
Context Sensitivity
- This diagnostic test is specific to the given dataset and task. Results may not generalize to other datasets or models without re-evaluation.

Code Example

This function evaluates a machine learning model’s performance using k-fold cross-validation on the training set and bootstrap sampling on the test set to establish and compare performance distributions.

import numpy as np
from sklearn.model_selection import cross_val_score
from scipy.stats import norm
from sklearn.utils import resample

def diagnostic_test_analysis(train, test, model, metric, k=10, n_bootstrap=100):
    """
    Perform a diagnostic test to evaluate the generalization gap of a model.

    Parameters:
        train (tuple): Tuple containing (X_train, y_train).
        test (tuple): Tuple containing (X_test, y_test).
        model: The chosen machine learning model to evaluate.
        metric: Performance metric function compatible with scikit-learn.
        k (int): Number of folds for cross-validation.
        n_bootstrap (int): Number of bootstrap samples for test set evaluation.

    Returns:
        dict: Dictionary containing results including performance gaps, test set distribution statistics,
              and whether the test harness score falls within the test set performance distribution.
    """
    X_train, y_train = train
    X_test, y_test = test

    # Step 1: Evaluate on training set (test harness) using k-fold cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=make_scorer(metric))
    train_performance = np.mean(cv_scores)

    # Step 2: Fit model on entire training set
    model.fit(X_train, y_train)

    # Step 3: Bootstrap evaluation on test set
    bootstrap_scores = []
    for _ in range(n_bootstrap):
        X_bootstrap, y_bootstrap = resample(X_test, y_test, replace=True)
        bootstrap_score = metric(y_bootstrap, model.predict(X_bootstrap))
        bootstrap_scores.append(bootstrap_score)

    # Calculate distribution statistics for bootstrap test set performance
    test_mean = np.mean(bootstrap_scores)
    test_std = np.std(bootstrap_scores)
    ci_low, ci_high = norm.interval(0.95, loc=test_mean, scale=test_std / np.sqrt(n_bootstrap))

    # Step 4: Compare train performance to test set distribution
    within_distribution = ci_low <= train_performance <= ci_high

    return {
        "Train Performance (Test Harness)": train_performance,
        "Test Set Mean": test_mean,
        "Test Set Std Dev": test_std,
        "95% CI": (ci_low, ci_high),
        "Within Distribution": within_distribution,
        "Bootstrap Scores": bootstrap_scores
    }

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic dataset
    from sklearn.datasets import make_classification
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score

    X, y = make_classification(n_samples=1000, n_features=5, random_state=42)
    X_train, X_test, y_train, y_test = X[:800], X[800:], y[:800], y[800:]

    # Define model and metric
    model = LogisticRegression()
    metric = accuracy_score

    # Run the diagnostic test analysis
    results = diagnostic_test_analysis(
        train=(X_train, y_train),
        test=(X_test, y_test),
        model=model,
        metric=metric,
        k=10,
        n_bootstrap=100
    )

    # Print Results
    print("Diagnostic Test Analysis Results:")
    for key, value in results.items():
        if key != "Bootstrap Scores":
            print(f"  - {key}: {value}")

Example Output

Diagnostic Test Analysis Results:
  - Train Performance (Test Harness): 0.8525
  - Test Set Mean: 0.8504
  - Test Set Std Dev: 0.0123
  - 95% CI: (0.8472, 0.8536)
  - Within Distribution: True

Performance Gap Reference Range Test Harness Performance Distribution