Check Score Correlation

Diagnostics

Model Overfitting

Assumptions

Predefined Dataset Split: The training dataset is already split into train/test sets.
Model and Metric: A single machine learning model and performance metric (e.g., accuracy, F1 score, RMSE) are defined for evaluation.

Procedure

Perform 10-Fold Cross-Validation
- What to do: Split the training dataset into 10 folds, using 9 folds for training and 1 fold for validation in each iteration.
- Data Collection: Record the performance metric scores for the model on both the train and validation sets for each fold.
Check Distribution Normality of Scores
- What to do: Use statistical tests and visualizations to assess if the distribution of performance scores is normal:
  - Apply the Shapiro-Wilk test.
  - Inspect histograms or Q-Q plots for a visual assessment of normality.
Calculate Correlation Between Train and Validation Scores
- What to do: Use the appropriate correlation measure based on the distribution of the scores:
  - If scores are normally distributed, calculate the Pearson correlation coefficient.
  - If scores are non-normal, use the Spearman rank correlation coefficient.
Test Correlation Significance
- What to do: Perform a hypothesis test to assess the significance of the correlation:
  - Null Hypothesis: There is no correlation between train and validation scores.
  - Report the p-value to confirm if the observed correlation is statistically significant.
Report Results
- What to do: Present findings in a clear table:
  - Include fold numbers, train scores, validation scores, the correlation coefficient, and the p-value.

Interpretation

Outcome

Results Provided:
- Correlation coefficient indicating the strength and direction of the relationship between train and validation scores.
- Statistical test results (p-value) to confirm significance.

Healthy/Problematic

Healthy Correlation:
- High positive correlation (e.g., r > 0.8) with a statistically significant p-value (e.g., p ≤ 0.05) indicates a consistent relationship between train and validation scores.
- Correlation suggests the model generalizes well across training and validation data.
Problematic Correlation:
- Weak or negative correlation (e.g., r < 0.5) or non-significant p-value (e.g., p > 0.05) suggests inconsistency or potential issues in training or validation data splits.
- Large discrepancies in score distributions may indicate overfitting.

Limitations

Data Size Sensitivity: Small datasets may lead to unreliable correlations due to limited variability in scores.
Metric-Specific: Results are tied to the chosen performance metric; other metrics may reveal different relationships.
Assumption of Consistency: Assumes that folds are representative of the overall data distribution; unbalanced folds may bias results.

Code Example

This function performs k-fold cross-validation on the training dataset, collects training and validation scores, and evaluates the correlation between them.

import numpy as np
from sklearn.model_selection import KFold
from scipy.stats import pearsonr, spearmanr, shapiro

def cross_val_score_correlation(train_X, train_y, model, metric, k=10):
    """
    Performs k-fold cross-validation on the training data, collects train and validation scores,
    and computes the correlation between them.

    Parameters:
    - train_X: Numpy array of training input features.
    - train_y: Numpy array of training target values.
    - model: An sklearn estimator (e.g., LogisticRegression(), RandomForestRegressor()).
    - metric: Performance metric function (e.g., accuracy_score, r2_score).
    - k: Number of folds for cross-validation (default is 10).

    Returns:
    - results: Dictionary containing correlation coefficient, p-value, correlation method, and interpretation.
    """
    kf = KFold(n_splits=k, shuffle=True, random_state=42)
    train_scores = []
    val_scores = []

    for train_index, val_index in kf.split(train_X):
        X_train_fold, X_val_fold = train_X[train_index], train_X[val_index]
        y_train_fold, y_val_fold = train_y[train_index], train_y[val_index]

        model.fit(X_train_fold, y_train_fold)

        y_train_pred = model.predict(X_train_fold)
        y_val_pred = model.predict(X_val_fold)

        train_score = metric(y_train_fold, y_train_pred)
        val_score = metric(y_val_fold, y_val_pred)

        train_scores.append(train_score)
        val_scores.append(val_score)

    train_scores = np.array(train_scores)
    val_scores = np.array(val_scores)

    # Check for normality
    _, p_train = shapiro(train_scores)
    _, p_val = shapiro(val_scores)
    alpha = 0.05

    if p_train > alpha and p_val > alpha:
        # Use Pearson correlation
        corr_coef, p_value = pearsonr(train_scores, val_scores)
        corr_method = 'Pearson'
    else:
        # Use Spearman correlation
        corr_coef, p_value = spearmanr(train_scores, val_scores)
        corr_method = 'Spearman'

    # Interpret the correlation
    if abs(corr_coef) > 0.8:
        interpretation = 'Strong correlation'
    elif abs(corr_coef) > 0.5:
        interpretation = 'Moderate correlation'
    else:
        interpretation = 'Weak correlation'

    results = {
        'correlation_coefficient': corr_coef,
        'p_value': p_value,
        'correlation_method': corr_method,
        'interpretation': interpretation
    }

    return results

from sklearn.datasets import make_classification, make_regression
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score, r2_score

# Classification example
# Generate synthetic classification data
X_class, y_class = make_classification(n_samples=500, n_features=20, n_informative=15, random_state=42)

# Initialize model and metric
model_class = LogisticRegression(max_iter=1000)
metric_class = accuracy_score

# Perform the diagnostic test
results_class = cross_val_score_correlation(X_class, y_class, model_class, metric_class, k=10)

# Print the results
print("Classification Results:")
print(f"Correlation Coefficient: {results_class['correlation_coefficient']:.4f}")
print(f"P-value: {results_class['p_value']:.4f}")
print(f"Correlation Method: {results_class['correlation_method']}")
print(f"Interpretation: {results_class['interpretation']}")
print("-" * 40)

# Regression example
# Generate synthetic regression data
X_reg, y_reg = make_regression(n_samples=500, n_features=20, n_informative=15, noise=0.1, random_state=42)

# Initialize model and metric
model_reg = LinearRegression()
metric_reg = r2_score

# Perform the diagnostic test
results_reg = cross_val_score_correlation(X_reg, y_reg, model_reg, metric_reg, k=10)

# Print the results
print("Regression Results:")
print(f"Correlation Coefficient: {results_reg['correlation_coefficient']:.4f}")
print(f"P-value: {results_reg['p_value']:.4f}")
print(f"Correlation Method: {results_reg['correlation_method']}")
print(f"Interpretation: {results_reg['interpretation']}")

Example Output

Classification Results:
Correlation Coefficient: -0.6285
P-value: 0.0516
Correlation Method: Spearman
Interpretation: Moderate correlation
----------------------------------------
Regression Results:
Correlation Coefficient: -0.9118
P-value: 0.0002
Correlation Method: Pearson
Interpretation: Strong correlation

Key Features

K-Fold Cross-Validation: Performs k-fold cross-validation to collect training and validation performance scores.
Dynamic Normality Testing: Utilizes the Shapiro-Wilk test to determine if score distributions are normal.
Appropriate Correlation Analysis: Selects Pearson or Spearman correlation based on the normality of the data.
Interpretation of Results: Provides a clear, concise interpretation of the correlation strength between training and validation scores.
Flexible Application: Supports both classification and regression models with customizable metrics.
User-Friendly Output: Returns results in a dictionary format for easy access and reporting.

What is Overfitting Compare Score Central Tendencies