Performance Correlation
Procedure
Assumes a given chosen model and performance metric and that we are using k-fold cross-validation for an apples-to-apples performance comparison.
-
Evaluate Performance on Train and Test Sets
- What to do: Apply the same k-fold cross-validation to the train and test sets and gather performance scores.
- Collect a sample of performance scores for the train set.
- Collect a sample of performance scores for the test set.
- What to do: Apply the same k-fold cross-validation to the train and test sets and gather performance scores.
-
Check for Correlation Between Performance Scores
- What to do: Analyze the relationship between performance scores from the train and test sets:
- Compute Pearson’s correlation coefficient if the scores are linearly related.
- Compute Spearman’s rank correlation coefficient if the relationship is non-linear or if the scores are ordinal.
- What to do: Analyze the relationship between performance scores from the train and test sets:
-
Interpret the Results
- What to do: Determine if the model’s train and test performance scores are highly correlated:
- If the correlation is strong and statistically significant, the model generalizes well across datasets.
- If the correlation is weak or non-significant, investigate overfitting, data issues, or inconsistencies in the evaluation procedure.
- What to do: Determine if the model’s train and test performance scores are highly correlated:
Code Example
Below is a Python function that calculates the correlation between the cross-validation scores on the train set and the test set, determines the normality of the score distributions, and interprets the correlation result.
import numpy as np
from sklearn.model_selection import cross_val_score
from scipy.stats import shapiro, pearsonr, spearmanr
from sklearn.metrics import make_scorer
def correlation_diagnostic(X_train, y_train, X_test, y_test, model, metric, k, alpha=0.05):
"""
Perform a diagnostic test to calculate the correlation between train and test CV scores.
Parameters:
X_train, y_train: Train set X and y elements.
X_test, y_test: Test set X and y elements.
model: The machine learning model to evaluate.
metric: Performance metric function compatible with scikit-learn.
k (int): Number of folds for cross-validation.
alpha (float): P-value cut-off for statistical tests (default=0.05).
Returns:
dict: Dictionary containing correlation, p-value, significance, and interpretation.
"""
# Cross-validation scores for the train set
cv_train_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=make_scorer(metric))
# Cross-validation scores for the test set
cv_test_scores = cross_val_score(model, X_test, y_test, cv=k, scoring=make_scorer(metric))
# Check normality of distributions
_, p_train_normal = shapiro(cv_train_scores)
_, p_test_normal = shapiro(cv_test_scores)
# Determine correlation method
if p_train_normal > alpha and p_test_normal > alpha:
# Both distributions are normal, use Pearson correlation
corr, p_value = pearsonr(cv_train_scores, cv_test_scores)
method = "Pearson"
else:
# At least one distribution is not normal, use Spearman correlation
corr, p_value = spearmanr(cv_train_scores, cv_test_scores)
method = "Spearman"
# Interpretation
significance = "significant" if p_value <= alpha else "not significant (insufficient evidence)"
# Interpret correlation strength
if corr >= 0.8:
interpretation = "Strongly Correlated"
elif corr >= 0.5:
interpretation = "Moderately Correlated"
elif corr >= 0.3:
interpretation = "Weakly Correlated"
else:
interpretation = "Uncorrelated or Negatively Correlated"
return {
"Method": method,
"Correlation": corr,
"P-Value": p_value,
"Significance": significance,
"Interpretation": interpretation
}
# Demo Usage
if __name__ == "__main__":
# Generate synthetic dataset
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = X[:800], X[800:], y[:800], y[800:]
# Define model and metric
chosen_model = RandomForestClassifier(random_state=42)
chosen_metric = accuracy_score
# Run the correlation diagnostic test
results = correlation_diagnostic(
X_train, y_train,
X_test, y_test,
chosen_model,
chosen_metric,
10
)
# Print Results
print("Correlation Diagnostic Test Results:")
for key, value in results.items():
print(f" - {key}: {value}")
Example Output
Correlation Diagnostic Test Results:
- Method: Pearson
- Correlation: 0.2670640334591372
- P-Value: 0.45571180013484286
- Significance: not significant
- Interpretation: Uncorrelated or Negatively Correlated