Performance Central Tendencies

Diagnostics

Procedure

Assumes a given chosen model and performance metric and that we are using k-fold cross-validation for an apples-to-apples performance comparison.

Prepare Performance Score Samples
- What to do: Apply the same k-fold cross-validation evaluation procedure to both the train and test datasets.
  - Collect a sample of performance scores from the train set.
  - Collect a sample of performance scores from the test set.
Choose a Statistical Test
- What to do: Decide whether to use a parametric or non-parametric test based on the distribution of the performance scores:
  - Use a t-Test if the scores are approximately normally distributed and have similar variances.
  - Use a Mann-Whitney U Test if the scores are not normally distributed or have unequal variances.
Perform the Statistical Test
- What to do: Apply the chosen statistical test to compare the central tendencies of the performance score samples:
  - For a t-Test, calculate the p-value to test for equality of means.
  - For a Mann-Whitney U Test, calculate the p-value to test for equality of distributions.
Interpret the Results
- What to do: Assess whether there is statistical evidence that the performance scores on the train and test sets differ:
  - If the p-value is less than the chosen significance level (e.g., 0.05), reject the null hypothesis and conclude that the train and test performance scores are significantly different.
  - If the p-value is greater than the significance level, fail to reject the null hypothesis, indicating no significant difference between the train and test performance scores.
Report the Findings
- What to do: Summarize the results of the statistical test for stakeholders:
  - Clearly state whether the performance scores on the train and test sets are statistically similar or different.
  - Include relevant metrics (e.g., p-value, mean/median scores) and implications for model generalizability.

Code Example

Below is a Python function that compares the performance of a model on train and test datasets using k-fold cross-validation and applies a statistical test to evaluate whether the performances differ.

import numpy as np
from sklearn.model_selection import cross_val_score
from scipy.stats import shapiro, ttest_ind, mannwhitneyu
from sklearn.metrics import make_scorer

def performance_comparison_test(X_train, y_train, X_test, y_test, model, metric, k, alpha=0.05):
    """
    Perform a diagnostic test to compare train and test set performances using k-fold CV.

    Parameters:
        X_train, y_train: Training set features and labels.
        X_test, y_test: Test set features and labels.
        model: Machine learning model to evaluate.
        metric: Scoring function compatible with scikit-learn.
        k (int): Number of folds for cross-validation.
        alpha (float): P-value threshold for statistical significance (default=0.05).

    Returns:
        dict: Contains statistical test results, p-value, and interpretation.
    """
    # Cross-validation scores for the train set
    train_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=make_scorer(metric))
    # Cross-validation scores for the test set
    test_scores = cross_val_score(model, X_test, y_test, cv=k, scoring=make_scorer(metric))

    # Check normality of distributions
    _, p_train_normal = shapiro(train_scores)
    _, p_test_normal = shapiro(test_scores)

    # Determine the appropriate statistical test
    if p_train_normal > alpha and p_test_normal > alpha:
        # Perform t-test if variances are equal
        stat, p_value = ttest_ind(train_scores, test_scores)
        test_type = "t-Test"
    else:
        # Perform Mann-Whitney U Test if variances differ
        stat, p_value = mannwhitneyu(train_scores, test_scores, alternative='two-sided')
        test_type = "Mann-Whitney U Test"

    # Interpret results
    significance = "significant" if p_value <= alpha else "not significant"
    interpretation = (
        "Train and test performances are similar"
        if significance == "not significant"
        else "Train and test performances differ"
    )

    return {
        "Statistical Test": test_type,
        "Statistic": stat,
        "P-Value": p_value,
        "Significance": significance,
        "Interpretation": interpretation
    }

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic dataset
    from sklearn.datasets import make_classification
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split

    # Create synthetic data
    X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Define model and metric
    chosen_model = RandomForestClassifier(random_state=42)
    chosen_metric = accuracy_score

    # Run the performance comparison test
    results = performance_comparison_test(
        X_train, y_train,
        X_test, y_test,
        chosen_model,
        chosen_metric,
        k=10
    )

    # Print results
    print("Performance Comparison Test Results:")
    for key, value in results.items():
        print(f"  - {key}: {value}")

Example Output

Performance Comparison Test Results:
  - Statistical Test: t-Test
  - Statistic: 1.5852581740085454
  - P-Value: 0.13031879056799284
  - Significance: not significant
  - Interpretation: Train and test performances are similar

Performance Correlation Performance Distributions