Model Performance Distribution

Diagnostics

Performance Gap

Is the difference on performance caused by the variance of the model?

Procedure

Train and Validate Model Versions
- Train and evaluate multiple versions of the chosen model on the training set (e.g. different random seeds).
- Record validation performance metrics for all versions.
Establish Performance Distribution on Train Set
- Compile the validation metrics into a distribution.
- Calculate summary statistics such as mean, standard deviation, and confidence intervals for the distribution.
Evaluate Model on the Test Set
- Use the unseen test set to evaluate the performance of a model fit on the entire train set.
- A variation may involve fitting multiple versions of the model on the train set and evaluating them on the test set to develop a distribution of performance scores.
- Record the test metrics for comparison.
Analyze the Chosen Model’s Performance
- Check if the performance of the model on the test set falls within the distribution of performance on the test harness.

Limitations

Model Diversity
- Limited diversity in the chosen model versions can lead to an incomplete representation of the performance range.
Test Set Size
- Small test sets may produce unreliable estimates due to high variance.
Data Leakage
- Any overlap or leakage between the validation and test sets can invalidate the results and lead to misleading conclusions.
Overemphasis on Statistical Metrics
- Performance metrics alone may not capture critical aspects of model suitability, such as interpretability or business relevance.
Non-IID Data
- If the data in the test set differs significantly from the training and validation sets (e.g., due to temporal shifts), the performance gap may not reflect true model generalization.

Code Example

Below is a Python function that performs the diagnostic test to compare the performance distributions of a chosen model across multiple trials using different random seeds. The function evaluates whether the test set performance is consistent with the train set performance distribution.

import numpy as np
from sklearn.model_selection import cross_val_score, KFold
from scipy.stats import ks_2samp
from sklearn.metrics import make_scorer

def performance_distribution_test(train, test, model, metric, k=10, trials=30):
    """
    Evaluate the consistency of performance distributions between train and test sets for a chosen model.

    Parameters:
        train (tuple): Tuple containing (X_train, y_train).
        test (tuple): Tuple containing (X_test, y_test).
        model: The chosen model instance compatible with scikit-learn.
        metric: Performance metric function compatible with scikit-learn.
        k (int): Number of folds for cross-validation.
        trials (int): Number of times to fit and evaluate the model using different random seeds.

    Returns:
        dict: Results containing train and test score distributions, and statistical test interpretation.
    """
    X_train, y_train = train
    X_test, y_test = test

    # Storage for performance scores
    train_scores = []
    test_scores = []

    # Perform multiple trials with different seeds
    for seed in range(trials):

        # assumes the model uses randomness
        model.random_state = seed

        # K-Fold Cross-Validation
        cv = KFold(n_splits=k, shuffle=True, random_state=42)
        cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring=make_scorer(metric))
        train_scores.append(np.mean(cv_scores))

        # Fit on train set and evaluate on test set
        model.fit(X_train, y_train)
        test_scores.append(metric(y_test, model.predict(X_test)))

    # Perform statistical comparison of distributions
    ks_stat, p_value = ks_2samp(train_scores, test_scores)
    interpretation = (
        "Distributions are similar (p > 0.05)."
        if p_value > 0.05 else
        "Distributions are different (p <= 0.05)."
    )

    return {
        "KS Statistic": ks_stat,
        "P-value": p_value,
        "Interpretation": interpretation
    }

# Demo Usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score

    # Generate synthetic dataset
    X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
    X_train, X_test, y_train, y_test = X[:800], X[800:], y[:800], y[800:]

    # Define chosen model and metric
    chosen_model = RandomForestClassifier()
    chosen_metric = accuracy_score

    # Run the diagnostic test
    results = performance_distribution_test(
        train=(X_train, y_train),
        test=(X_test, y_test),
        model=chosen_model,
        metric=chosen_metric,
        k=5,
        trials=20
    )

    # Print Results
    print("Performance Distribution Test Results:")
    for key, value in results.items():
        print(f"  - {key}: {value}")

Example Output

Performance Distribution Test Results:
  - KS Statistic: 0.9
  - P-value: 1.1316933501002751e-08
  - Interpretation: Distributions are different (p <= 0.05).

Test Harness Performance Distribution