Test Harness Performance Distribution

Diagnostics

Performance Gap

Is the difference in performance caused by the variance of the test harness?

Procedure

Train the Chosen Model on the Test Harness
- Use the chosen model and train it multiple times on the training set using a sampling techniques such as k-fold cross-validation, bootstrapping, or repeated train/test splits.
- Record the model’s performance for each training iteration to build a distribution of expected performance metrics (e.g., accuracy, RMSE, F1 score).
Calculate the Distribution of Expected Performance
- From the results of the training on the test harness, compute the distribution of the performance metric.
- Derive key statistics such as the mean, standard deviation, confidence intervals, and min/max values of the distribution.
Fit the Chosen Model on the Entire Training Set
- Train the chosen model on the entire training set without sampling or splitting.
- Ensure all preprocessing steps are consistent with those used during the test harness evaluation phase.
Evaluate the Chosen Model on the Test Set
- Apply the fully trained model to the unseen test set.
- Record the performance metric as a point estimate.
Compare Test Set Performance to the Training Distribution
- Check if the test set performance falls within the range or confidence interval of the distribution derived from the test harness.
- Assess the generalization gap by comparing the test set point estimate to the mean performance of the test harness distribution.
Interpret the Results
- Report whether the test set performance aligns with expectations and any anomalies observed.

Limitations

Dependence on Sampling Methodology
- The reliability of the test harness performance distribution depends on the robustness of the sampling methodology (e.g., k-fold cross-validation or bootstrapping).
Dataset Size
- Small training sets may result in highly variable performance distributions, reducing confidence in the comparison.
Sensitivity to Outliers
- Outliers in the test set or training distribution can distort results and lead to misleading conclusions about generalization performance.
Single Test Set
- The evaluation is limited to a single test set, which may not fully represent the model’s ability to generalize to unseen data.
Assumptions of Consistency
- The method assumes that the train and test data come from the same distribution. Any shift in the data distribution may render the comparison invalid.

Code Example

Below is a Python function that evaluates a chosen model’s test set performance, and compares it against the distribution of scores from k-fold cross-validation on the training set.

import numpy as np
from sklearn.model_selection import cross_val_score
from scipy.stats import norm
from sklearn.utils import resample
from sklearn.metrics import make_scorer

def diagnostic_test_analysis(train, test, model, metric, k=10, n_bootstrap=100):
    """
    Perform a diagnostic test to evaluate the generalization gap of a model.

    Parameters:
        train (tuple): Tuple containing (X_train, y_train).
        test (tuple): Tuple containing (X_test, y_test).
        model: The chosen machine learning model to evaluate.
        metric: Performance metric function compatible with scikit-learn.
        k (int): Number of folds for cross-validation.
        n_bootstrap (int): Number of bootstrap samples for test set evaluation.

    Returns:
        dict: Dictionary containing results including performance gaps, test set distribution statistics,
              and whether the test harness score falls within the test set performance distribution.
    """
    X_train, y_train = train
    X_test, y_test = test

    # Step 1: Evaluate on training set (test harness) using k-fold cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=make_scorer(metric))
    train_performance = np.mean(cv_scores)

    # Step 2: Fit model on entire training set
    model.fit(X_train, y_train)

    # Step 3: Bootstrap evaluation on test set
    bootstrap_scores = []
    for _ in range(n_bootstrap):
        X_bootstrap, y_bootstrap = resample(X_test, y_test, replace=True)
        bootstrap_score = metric(y_bootstrap, model.predict(X_bootstrap))
        bootstrap_scores.append(bootstrap_score)

    # Calculate distribution statistics for bootstrap test set performance
    test_mean = np.mean(bootstrap_scores)
    test_std = np.std(bootstrap_scores)
    ci_low, ci_high = norm.interval(0.95, loc=test_mean, scale=test_std / np.sqrt(n_bootstrap))

    # Step 4: Compare train performance to test set distribution
    within_distribution = ci_low <= train_performance <= ci_high

    return {
        "Train Performance (Test Harness)": train_performance,
        "Test Set Mean": test_mean,
        "Test Set Std Dev": test_std,
        "95% CI": (ci_low, ci_high),
        "Within Distribution": within_distribution,
        "Bootstrap Scores": bootstrap_scores
    }

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic dataset
    from sklearn.datasets import make_classification
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score

    X, y = make_classification(n_samples=1000, n_features=5, random_state=42)
    X_train, X_test, y_train, y_test = X[:800], X[800:], y[:800], y[800:]

    # Define model and metric
    model = LogisticRegression()
    metric = accuracy_score

    # Run the diagnostic test analysis
    results = diagnostic_test_analysis(
        train=(X_train, y_train),
        test=(X_test, y_test),
        model=model,
        metric=metric,
        k=10,
        n_bootstrap=100
    )

    # Print Results
    print("Diagnostic Test Analysis Results:")
    for key, value in results.items():
        if key != "Bootstrap Scores":
            print(f"  - {key}: {value}")

Example Output

Diagnostic Test Analysis Results:
  - Train Performance (Test Harness): 0.86625
  - Test Set Mean: 0.8449499999999999
  - Test Set Std Dev: 0.027189106274388634
  - 95% CI: (np.float64(0.8396210330930365), np.float64(0.8502789669069633))
  - Within Distribution: False

Test Set Performance Bootstrap Model Performance Distribution