Performance Correlation

Diagnostics

Procedure

Assumes a given chosen model and performance metric and that we are using k-fold cross-validation for an apples-to-apples performance comparison.

Evaluate Performance on Train and Test Sets
- What to do: Apply the same k-fold cross-validation to the train and test sets and gather performance scores.
  - Collect a sample of performance scores for the train set.
  - Collect a sample of performance scores for the test set.
Check for Correlation Between Performance Scores
- What to do: Analyze the relationship between performance scores from the train and test sets:
  - Compute Pearson’s correlation coefficient if the scores are linearly related.
  - Compute Spearman’s rank correlation coefficient if the relationship is non-linear or if the scores are ordinal.
Interpret the Results
- What to do: Determine if the model’s train and test performance scores are highly correlated:
  - If the correlation is strong and statistically significant, the model generalizes well across datasets.
  - If the correlation is weak or non-significant, investigate overfitting, data issues, or inconsistencies in the evaluation procedure.

Code Example

Below is a Python function that calculates the correlation between the cross-validation scores on the train set and the test set, determines the normality of the score distributions, and interprets the correlation result.

import numpy as np
from sklearn.model_selection import cross_val_score
from scipy.stats import shapiro, pearsonr, spearmanr
from sklearn.metrics import make_scorer

def correlation_diagnostic(X_train, y_train, X_test, y_test, model, metric, k, alpha=0.05):
    """
    Perform a diagnostic test to calculate the correlation between train and test CV scores.

    Parameters:
        X_train, y_train: Train set X and y elements.
        X_test, y_test: Test set X and y elements.
        model: The machine learning model to evaluate.
        metric: Performance metric function compatible with scikit-learn.
        k (int): Number of folds for cross-validation.
        alpha (float): P-value cut-off for statistical tests (default=0.05).

    Returns:
        dict: Dictionary containing correlation, p-value, significance, and interpretation.
    """
    # Cross-validation scores for the train set
    cv_train_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=make_scorer(metric))
    # Cross-validation scores for the test set
    cv_test_scores = cross_val_score(model, X_test, y_test, cv=k, scoring=make_scorer(metric))

    # Check normality of distributions
    _, p_train_normal = shapiro(cv_train_scores)
    _, p_test_normal = shapiro(cv_test_scores)

    # Determine correlation method
    if p_train_normal > alpha and p_test_normal > alpha:
        # Both distributions are normal, use Pearson correlation
        corr, p_value = pearsonr(cv_train_scores, cv_test_scores)
        method = "Pearson"
    else:
        # At least one distribution is not normal, use Spearman correlation
        corr, p_value = spearmanr(cv_train_scores, cv_test_scores)
        method = "Spearman"

    # Interpretation
    significance = "significant" if p_value <= alpha else "not significant (insufficient evidence)"

    # Interpret correlation strength
    if corr >= 0.8:
        interpretation = "Strongly Correlated"
    elif corr >= 0.5:
        interpretation = "Moderately Correlated"
    elif corr >= 0.3:
        interpretation = "Weakly Correlated"
    else:
        interpretation = "Uncorrelated or Negatively Correlated"

    return {
        "Method": method,
        "Correlation": corr,
        "P-Value": p_value,
        "Significance": significance,
        "Interpretation": interpretation
    }

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic dataset
    from sklearn.datasets import make_classification
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score

    X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
    X_train, X_test, y_train, y_test = X[:800], X[800:], y[:800], y[800:]

    # Define model and metric
    chosen_model = RandomForestClassifier(random_state=42)
    chosen_metric = accuracy_score

    # Run the correlation diagnostic test
    results = correlation_diagnostic(
        X_train, y_train,
        X_test, y_test,
        chosen_model,
        chosen_metric,
        10
    )

    # Print Results
    print("Correlation Diagnostic Test Results:")
    for key, value in results.items():
        print(f"  - {key}: {value}")

Example Output

Correlation Diagnostic Test Results:
  - Method: Pearson
  - Correlation: 0.2670640334591372
  - P-Value: 0.45571180013484286
  - Significance: not significant
  - Interpretation: Uncorrelated or Negatively Correlated

Performance Central Tendencies