Test Harness Performance Distribution

Test Harness Performance Distribution

Is the difference in performance caused by the variance of the test harness?

Procedure

  1. Train the Chosen Model on the Test Harness

    • Use the chosen model and train it multiple times on the training set using a sampling techniques such as k-fold cross-validation, bootstrapping, or repeated train/test splits.
    • Record the model’s performance for each training iteration to build a distribution of expected performance metrics (e.g., accuracy, RMSE, F1 score).
  2. Calculate the Distribution of Expected Performance

    • From the results of the training on the test harness, compute the distribution of the performance metric.
    • Derive key statistics such as the mean, standard deviation, confidence intervals, and min/max values of the distribution.
  3. Fit the Chosen Model on the Entire Training Set

    • Train the chosen model on the entire training set without sampling or splitting.
    • Ensure all preprocessing steps are consistent with those used during the test harness evaluation phase.
  4. Evaluate the Chosen Model on the Test Set

    • Apply the fully trained model to the unseen test set.
    • Record the performance metric as a point estimate.
  5. Compare Test Set Performance to the Training Distribution

    • Check if the test set performance falls within the range or confidence interval of the distribution derived from the test harness.
    • Assess the generalization gap by comparing the test set point estimate to the mean performance of the test harness distribution.
  6. Interpret the Results

    • Report whether the test set performance aligns with expectations and any anomalies observed.

Limitations

  • Dependence on Sampling Methodology

    • The reliability of the test harness performance distribution depends on the robustness of the sampling methodology (e.g., k-fold cross-validation or bootstrapping).
  • Dataset Size

    • Small training sets may result in highly variable performance distributions, reducing confidence in the comparison.
  • Sensitivity to Outliers

    • Outliers in the test set or training distribution can distort results and lead to misleading conclusions about generalization performance.
  • Single Test Set

    • The evaluation is limited to a single test set, which may not fully represent the model’s ability to generalize to unseen data.
  • Assumptions of Consistency

    • The method assumes that the train and test data come from the same distribution. Any shift in the data distribution may render the comparison invalid.

Code Example

Below is a Python function that evaluates a chosen model’s test set performance, and compares it against the distribution of scores from k-fold cross-validation on the training set.

import numpy as np
from sklearn.model_selection import cross_val_score
from scipy.stats import norm
from sklearn.utils import resample
from sklearn.metrics import make_scorer

def diagnostic_test_analysis(train, test, model, metric, k=10, n_bootstrap=100):
    """
    Perform a diagnostic test to evaluate the generalization gap of a model.

    Parameters:
        train (tuple): Tuple containing (X_train, y_train).
        test (tuple): Tuple containing (X_test, y_test).
        model: The chosen machine learning model to evaluate.
        metric: Performance metric function compatible with scikit-learn.
        k (int): Number of folds for cross-validation.
        n_bootstrap (int): Number of bootstrap samples for test set evaluation.

    Returns:
        dict: Dictionary containing results including performance gaps, test set distribution statistics,
              and whether the test harness score falls within the test set performance distribution.
    """
    X_train, y_train = train
    X_test, y_test = test

    # Step 1: Evaluate on training set (test harness) using k-fold cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=make_scorer(metric))
    train_performance = np.mean(cv_scores)

    # Step 2: Fit model on entire training set
    model.fit(X_train, y_train)

    # Step 3: Bootstrap evaluation on test set
    bootstrap_scores = []
    for _ in range(n_bootstrap):
        X_bootstrap, y_bootstrap = resample(X_test, y_test, replace=True)
        bootstrap_score = metric(y_bootstrap, model.predict(X_bootstrap))
        bootstrap_scores.append(bootstrap_score)

    # Calculate distribution statistics for bootstrap test set performance
    test_mean = np.mean(bootstrap_scores)
    test_std = np.std(bootstrap_scores)
    ci_low, ci_high = norm.interval(0.95, loc=test_mean, scale=test_std / np.sqrt(n_bootstrap))

    # Step 4: Compare train performance to test set distribution
    within_distribution = ci_low <= train_performance <= ci_high

    return {
        "Train Performance (Test Harness)": train_performance,
        "Test Set Mean": test_mean,
        "Test Set Std Dev": test_std,
        "95% CI": (ci_low, ci_high),
        "Within Distribution": within_distribution,
        "Bootstrap Scores": bootstrap_scores
    }

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic dataset
    from sklearn.datasets import make_classification
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score

    X, y = make_classification(n_samples=1000, n_features=5, random_state=42)
    X_train, X_test, y_train, y_test = X[:800], X[800:], y[:800], y[800:]

    # Define model and metric
    model = LogisticRegression()
    metric = accuracy_score

    # Run the diagnostic test analysis
    results = diagnostic_test_analysis(
        train=(X_train, y_train),
        test=(X_test, y_test),
        model=model,
        metric=metric,
        k=10,
        n_bootstrap=100
    )

    # Print Results
    print("Diagnostic Test Analysis Results:")
    for key, value in results.items():
        if key != "Bootstrap Scores":
            print(f"  - {key}: {value}")

Example Output

Diagnostic Test Analysis Results:
  - Train Performance (Test Harness): 0.86625
  - Test Set Mean: 0.8449499999999999
  - Test Set Std Dev: 0.027189106274388634
  - 95% CI: (np.float64(0.8396210330930365), np.float64(0.8502789669069633))
  - Within Distribution: False