Performance Variance

Diagnostics

Procedure

Assumes a given chosen model and performance metric and that we are using k-fold cross-validation for an apples-to-apples performance comparison.

Evaluate the Chosen Model on Train and Test Sets
- What to do: Use the chosen evaluation procedure (e.g., k-fold cross-validation) to gather performance scores:
  - Apply k-fold cross-validation to the train set and collect a sample of performance scores.
  - Apply the same k-fold cross-validation procedure to the test set and collect a separate sample of performance scores.
Choose the Variance Testing Method
- What to do: Select a statistical test to compare the variances of the two samples:
  - Use an F-test if the performance score samples are normally distributed.
  - Use Levene’s test or Brown-Forsythe test if the samples may not follow normal distributions or have unequal sizes.
Verify Assumptions for the Test
- What to do: Ensure the chosen test’s assumptions are met:
  - For the F-test, assess normality of the performance score samples using a normality test (e.g., Shapiro-Wilk or Kolmogorov-Smirnov).
  - For Levene’s or Brown-Forsythe tests, confirm the independence of the samples and review sample size requirements.
Perform the Variance Test
- What to do: Compare the variances of the train and test performance score samples:
  - Input the two samples into the selected test.
  - Record the test statistic and p-value.
Interpret the Results
- What to do: Analyze the output of the variance test:
  - If the p-value is less than the significance level (commonly 0.05), conclude that the variances are significantly different.
  - If the p-value is greater than the significance level, conclude that there is no significant difference in variances.
Report the Results
- What to do: Summarize the findings:
  - Clearly state the test used, the test statistic, and the p-value.
  - Explain the practical implications for the model’s performance stability across the train and test sets (e.g., variance inconsistency might suggest overfitting or instability).

Code Example

This function tests whether the variances of performance scores on the train and test sets are statistically different. It automatically selects the appropriate test based on the normality of the score distributions, using a chosen model, scoring metric, and k-fold cross-validation.

import numpy as np
from sklearn.model_selection import cross_val_score
from scipy.stats import shapiro, levene, f_oneway
from sklearn.metrics import make_scorer

def variance_diagnostic(X_train, y_train, X_test, y_test, model, metric, k, alpha=0.05):
    """
    Perform a diagnostic test to compare variances of train and test CV scores.

    Parameters:
        X_train, y_train: Train set X and y elements.
        X_test, y_test: Test set X and y elements.
        model: The machine learning model to evaluate.
        metric: Performance metric function compatible with scikit-learn.
        k (int): Number of folds for cross-validation.
        alpha (float): P-value cut-off for statistical tests (default=0.05).

    Returns:
        dict: Dictionary containing the test name, statistic, p-value, significance, and interpretation.
    """
    # Cross-validation scores for the train set
    cv_train_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=make_scorer(metric))
    # Cross-validation scores for the test set
    cv_test_scores = cross_val_score(model, X_test, y_test, cv=k, scoring=make_scorer(metric))

    # Check normality of the score distributions
    _, p_train_normal = shapiro(cv_train_scores)
    _, p_test_normal = shapiro(cv_test_scores)

    if p_train_normal > alpha and p_test_normal > alpha:
        # Both distributions are normal, use F-test
        stat, p_value = f_oneway(cv_train_scores, cv_test_scores)
        test_name = "F-Test"
    else:
        # Use Levene's test if normality is not met
        stat, p_value = levene(cv_train_scores, cv_test_scores)
        test_name = "Levene's Test"

    # Interpret results
    significance = "significant" if p_value <= alpha else "not significant"
    interpretation = "Variances are different" if p_value <= alpha else "No evidence of variance difference"

    return {
        "Test Name": test_name,
        "Statistic": stat,
        "P-Value": p_value,
        "Significance": significance,
        "Interpretation": interpretation
    }

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic dataset
    from sklearn.datasets import make_classification
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score

    X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
    X_train, X_test, y_train, y_test = X[:800], X[800:], y[:800], y[800:]

    # Define model and metric
    chosen_model = RandomForestClassifier(random_state=42)
    chosen_metric = accuracy_score

    # Run the variance diagnostic test
    results = variance_diagnostic(
        X_train, y_train,
        X_test, y_test,
        chosen_model,
        chosen_metric,
        10
    )

    # Print Results
    print("Variance Diagnostic Test Results:")
    for key, value in results.items():
        print(f"  - {key}: {value}")

Example Output

Variance Diagnostic Test Results:
  - Test Name: F-Test
  - Statistic: 1.638202247191011
  - P-Value: 0.2168230881706279
  - Significance: not significant
  - Interpretation: No evidence of variance difference

Performance Distributions Performance Effect Size