Performance Distributions

Performance Distributions

Procedure

Assumes a given chosen model and performance metric and that we are using k-fold cross-validation for an apples-to-apples performance comparison.

  1. Prepare Samples of Performance Scores

    • What to do: Collect performance scores using the same evaluation procedure for both train and test sets.
      • Apply k-fold cross-validation to the train set to obtain a sample of performance scores.
      • Apply k-fold cross-validation to the test set to obtain a comparable sample of performance scores.
  2. Choose a Statistical Test

    • What to do: Decide on the appropriate statistical test for comparing distributions of the train and test performance scores.
      • Use the Kolmogorov-Smirnov Test to assess if the two distributions differ significantly.
      • Alternatively, use the Anderson-Darling Test for a more sensitive comparison if the sample sizes are small.
  3. Perform the Statistical Test

    • What to do: Execute the chosen test to compare the distributions of performance scores.
      • Input the train and test performance score samples into the selected statistical test.
      • Record the test statistic and p-value from the result.
  4. Interpret the Results

    • What to do: Assess whether the train and test score distributions are statistically similar or different.
      • If the p-value is below the chosen significance level (e.g., 0.05), conclude that the distributions are different.
      • If the p-value is above the significance level, conclude that there is no significant difference in the distributions.
  5. Report the Findings

    • What to do: Summarize the results of the statistical test for stakeholders.
      • Clearly state whether the train and test distributions are statistically similar or different.
      • Discuss potential implications for model generalization or performance issues, if applicable.

Code Example

Below is a Python function that compares the distributions of cross-validation performance scores on the train and test sets using statistical tests to assess similarity.

import numpy as np
from sklearn.model_selection import cross_val_score
from scipy.stats import ks_2samp
from sklearn.metrics import make_scorer

def distribution_comparison_diagnostic(X_train, y_train, X_test, y_test, model, metric, k, alpha=0.05):
    """
    Compare the distributions of cross-validation performance scores on the train and test sets.

    Parameters:
        X_train, y_train: Train set features and target.
        X_test, y_test: Test set features and target.
        model: Machine learning model to evaluate.
        metric: Performance metric function compatible with scikit-learn.
        k (int): Number of folds for cross-validation.
        alpha (float): Significance level for statistical tests (default=0.05).

    Returns:
        dict: Contains the test statistic, p-value, and interpretation of the test.
    """
    # Cross-validation scores for train and test sets
    cv_train_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=make_scorer(metric))
    cv_test_scores = cross_val_score(model, X_test, y_test, cv=k, scoring=make_scorer(metric))

    # Perform Kolmogorov-Smirnov Test
    ks_stat, ks_p_value = ks_2samp(cv_train_scores, cv_test_scores)

    # Interpret results
    ks_significant = "significant difference" if ks_p_value <= alpha else "no significant difference"

    return {
        "Kolmogorov-Smirnov Test": {
            "Statistic": ks_stat,
            "P-Value": ks_p_value,
            "Significance": ks_significant
        }
    }

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic dataset
    from sklearn.datasets import make_classification
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score

    X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
    X_train, X_test, y_train, y_test = X[:800], X[800:], y[:800], y[800:]

    # Define model and metric
    chosen_model = RandomForestClassifier(random_state=42)
    chosen_metric = accuracy_score

    # Run the distribution comparison diagnostic test
    results = distribution_comparison_diagnostic(
        X_train, y_train,
        X_test, y_test,
        chosen_model,
        chosen_metric,
        k=10
    )

    # Print Results
    print("Distribution Comparison Diagnostic Test Results:")
    for test_name, result in results.items():
        print(f"\n{test_name}:")
        for key, value in result.items():
            print(f"  - {key}: {value}")

Example Output

Distribution Comparison Diagnostic Test Results:

Kolmogorov-Smirnov Test:
  - Statistic: 0.3
  - P-Value: 0.7869297884777761
  - Significance: no significant difference