Check Score Correlation
Assumptions
- Predefined Dataset Split: The training dataset is already split into train/test sets.
- Model and Metric: A single machine learning model and performance metric (e.g., accuracy, F1 score, RMSE) are defined for evaluation.
Procedure
-
Perform 10-Fold Cross-Validation
- What to do: Split the training dataset into 10 folds, using 9 folds for training and 1 fold for validation in each iteration.
- Data Collection: Record the performance metric scores for the model on both the train and validation sets for each fold.
-
Check Distribution Normality of Scores
- What to do: Use statistical tests and visualizations to assess if the distribution of performance scores is normal:
- Apply the Shapiro-Wilk test.
- Inspect histograms or Q-Q plots for a visual assessment of normality.
- What to do: Use statistical tests and visualizations to assess if the distribution of performance scores is normal:
-
Calculate Correlation Between Train and Validation Scores
- What to do: Use the appropriate correlation measure based on the distribution of the scores:
- If scores are normally distributed, calculate the Pearson correlation coefficient.
- If scores are non-normal, use the Spearman rank correlation coefficient.
- What to do: Use the appropriate correlation measure based on the distribution of the scores:
-
Test Correlation Significance
- What to do: Perform a hypothesis test to assess the significance of the correlation:
- Null Hypothesis: There is no correlation between train and validation scores.
- Report the p-value to confirm if the observed correlation is statistically significant.
- What to do: Perform a hypothesis test to assess the significance of the correlation:
-
Report Results
- What to do: Present findings in a clear table:
- Include fold numbers, train scores, validation scores, the correlation coefficient, and the p-value.
- What to do: Present findings in a clear table:
Interpretation
Outcome
- Results Provided:
- Correlation coefficient indicating the strength and direction of the relationship between train and validation scores.
- Statistical test results (p-value) to confirm significance.
Healthy/Problematic
- Healthy Correlation:
- High positive correlation (e.g., r > 0.8) with a statistically significant p-value (e.g., p ≤ 0.05) indicates a consistent relationship between train and validation scores.
- Correlation suggests the model generalizes well across training and validation data.
- Problematic Correlation:
- Weak or negative correlation (e.g., r < 0.5) or non-significant p-value (e.g., p > 0.05) suggests inconsistency or potential issues in training or validation data splits.
- Large discrepancies in score distributions may indicate overfitting.
Limitations
- Data Size Sensitivity: Small datasets may lead to unreliable correlations due to limited variability in scores.
- Metric-Specific: Results are tied to the chosen performance metric; other metrics may reveal different relationships.
- Assumption of Consistency: Assumes that folds are representative of the overall data distribution; unbalanced folds may bias results.
Code Example
This function performs k-fold cross-validation on the training dataset, collects training and validation scores, and evaluates the correlation between them.
import numpy as np
from sklearn.model_selection import KFold
from scipy.stats import pearsonr, spearmanr, shapiro
def cross_val_score_correlation(train_X, train_y, model, metric, k=10):
"""
Performs k-fold cross-validation on the training data, collects train and validation scores,
and computes the correlation between them.
Parameters:
- train_X: Numpy array of training input features.
- train_y: Numpy array of training target values.
- model: An sklearn estimator (e.g., LogisticRegression(), RandomForestRegressor()).
- metric: Performance metric function (e.g., accuracy_score, r2_score).
- k: Number of folds for cross-validation (default is 10).
Returns:
- results: Dictionary containing correlation coefficient, p-value, correlation method, and interpretation.
"""
kf = KFold(n_splits=k, shuffle=True, random_state=42)
train_scores = []
val_scores = []
for train_index, val_index in kf.split(train_X):
X_train_fold, X_val_fold = train_X[train_index], train_X[val_index]
y_train_fold, y_val_fold = train_y[train_index], train_y[val_index]
model.fit(X_train_fold, y_train_fold)
y_train_pred = model.predict(X_train_fold)
y_val_pred = model.predict(X_val_fold)
train_score = metric(y_train_fold, y_train_pred)
val_score = metric(y_val_fold, y_val_pred)
train_scores.append(train_score)
val_scores.append(val_score)
train_scores = np.array(train_scores)
val_scores = np.array(val_scores)
# Check for normality
_, p_train = shapiro(train_scores)
_, p_val = shapiro(val_scores)
alpha = 0.05
if p_train > alpha and p_val > alpha:
# Use Pearson correlation
corr_coef, p_value = pearsonr(train_scores, val_scores)
corr_method = 'Pearson'
else:
# Use Spearman correlation
corr_coef, p_value = spearmanr(train_scores, val_scores)
corr_method = 'Spearman'
# Interpret the correlation
if abs(corr_coef) > 0.8:
interpretation = 'Strong correlation'
elif abs(corr_coef) > 0.5:
interpretation = 'Moderate correlation'
else:
interpretation = 'Weak correlation'
results = {
'correlation_coefficient': corr_coef,
'p_value': p_value,
'correlation_method': corr_method,
'interpretation': interpretation
}
return results
from sklearn.datasets import make_classification, make_regression
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score, r2_score
# Classification example
# Generate synthetic classification data
X_class, y_class = make_classification(n_samples=500, n_features=20, n_informative=15, random_state=42)
# Initialize model and metric
model_class = LogisticRegression(max_iter=1000)
metric_class = accuracy_score
# Perform the diagnostic test
results_class = cross_val_score_correlation(X_class, y_class, model_class, metric_class, k=10)
# Print the results
print("Classification Results:")
print(f"Correlation Coefficient: {results_class['correlation_coefficient']:.4f}")
print(f"P-value: {results_class['p_value']:.4f}")
print(f"Correlation Method: {results_class['correlation_method']}")
print(f"Interpretation: {results_class['interpretation']}")
print("-" * 40)
# Regression example
# Generate synthetic regression data
X_reg, y_reg = make_regression(n_samples=500, n_features=20, n_informative=15, noise=0.1, random_state=42)
# Initialize model and metric
model_reg = LinearRegression()
metric_reg = r2_score
# Perform the diagnostic test
results_reg = cross_val_score_correlation(X_reg, y_reg, model_reg, metric_reg, k=10)
# Print the results
print("Regression Results:")
print(f"Correlation Coefficient: {results_reg['correlation_coefficient']:.4f}")
print(f"P-value: {results_reg['p_value']:.4f}")
print(f"Correlation Method: {results_reg['correlation_method']}")
print(f"Interpretation: {results_reg['interpretation']}")
Example Output
Classification Results:
Correlation Coefficient: -0.6285
P-value: 0.0516
Correlation Method: Spearman
Interpretation: Moderate correlation
----------------------------------------
Regression Results:
Correlation Coefficient: -0.9118
P-value: 0.0002
Correlation Method: Pearson
Interpretation: Strong correlation
Key Features
- K-Fold Cross-Validation: Performs k-fold cross-validation to collect training and validation performance scores.
- Dynamic Normality Testing: Utilizes the Shapiro-Wilk test to determine if score distributions are normal.
- Appropriate Correlation Analysis: Selects Pearson or Spearman correlation based on the normality of the data.
- Interpretation of Results: Provides a clear, concise interpretation of the correlation strength between training and validation scores.
- Flexible Application: Supports both classification and regression models with customizable metrics.
- User-Friendly Output: Returns results in a dictionary format for easy access and reporting.