Compare Score Central Tendencies

Compare Score Central Tendencies

Assumptions

  1. Predefined Dataset Split: The training dataset is already split into train/test sets.
  2. Model and Metric: A single machine learning model and performance metric (e.g., accuracy, F1 score, RMSE) are defined for evaluation.

Procedure

  1. Perform 10-Fold Cross-Validation

    • What to do: Split the training dataset into 10 folds. For each fold:
      • Train the model on 9 folds.
      • Evaluate performance on the held-out fold.
    • Data Collection: Gather the performance scores for both train folds and test folds across all iterations.
  2. Check Normality of Score Distributions

    • What to do: Use statistical tests and visualizations to assess if train and test fold performance score distributions are normal:
      • Apply the Shapiro-Wilk test.
      • Generate histograms or Q-Q plots for visual confirmation of normality.
  3. Compare Central Tendencies

    • What to do: Select the hypothesis test based on the normality of score distributions:
      • If normal, use a paired t-test to compare mean scores.
      • If non-normal, use the Wilcoxon signed-rank test to compare median scores.
  4. Calculate the Absolute Difference in Central Tendencies

    • What to do: Measure the absolute difference between the central tendencies (mean or median) of train and test fold scores.
    • Record this difference to assess the gap between training and testing performance.
  5. Interpret and Report Results

    • What to do: Present findings in a clear table:
      • Include fold numbers, train scores, test scores, p-values, and the absolute difference in central tendencies.

Interpretation

Outcome

  • Results Provided:
    • Central tendencies (mean or median) of train and test fold scores.
    • p-value from the hypothesis test indicating whether the difference is statistically significant.
    • Absolute difference in central tendencies.

Healthy/Problematic

  • Healthy Central Tendencies:
    • Small absolute differences in central tendencies with non-significant p-values (e.g., p > 0.05) indicate consistent performance between train and test folds.
    • Suggests the model generalizes well without significant overfitting.
  • Problematic Central Tendencies:
    • Large absolute differences or significant p-values (e.g., p ≤ 0.05) suggest potential overfitting or an inability to generalize.
    • Train fold performance significantly exceeding test fold performance indicates overfitting.

Limitations

  • Model and Metric Dependence: Results are specific to the chosen model and performance metric; testing with multiple metrics may provide a fuller picture.
  • Dataset Size and Balance: Small or imbalanced datasets may result in unreliable results or biased folds.
  • Assumption of Fold Independence: Assumes that folds represent independent samples, which may not hold in cases of temporal or spatial correlation.

Code Example

This function performs k-fold cross-validation on provided train/test datasets, collects training and validation scores, applies statistical tests to compare their central tendencies, and interprets the results.

import numpy as np
from sklearn.model_selection import KFold
from scipy.stats import shapiro, ttest_rel, wilcoxon

def compare_train_test_scores(train_X, train_y, test_X, test_y, model, metric, k=10):
    """
    Performs k-fold cross-validation on the training data, gathers train and validation scores,
    compares their central tendencies, and interprets the results.

    Parameters:
    - train_X: Numpy array of training input features.
    - train_y: Numpy array of training target values.
    - test_X: Numpy array of test input features.
    - test_y: Numpy array of test target values.
    - model: An sklearn estimator (e.g., LogisticRegression(), RandomForestRegressor()).
    - metric: Performance metric function (e.g., accuracy_score, r2_score).
    - k: Number of folds for cross-validation (default is 10).

    Returns:
    - results: Dictionary containing mean/median differences, p-value, test used, and an interpretation.
    """
    kf = KFold(n_splits=k, shuffle=True, random_state=42)
    train_scores = []
    val_scores = []

    for train_index, val_index in kf.split(train_X):
        X_train_fold, X_val_fold = train_X[train_index], train_X[val_index]
        y_train_fold, y_val_fold = train_y[train_index], train_y[val_index]

        model.fit(X_train_fold, y_train_fold)

        train_pred = model.predict(X_train_fold)
        val_pred = model.predict(X_val_fold)

        train_scores.append(metric(y_train_fold, train_pred))
        val_scores.append(metric(y_val_fold, val_pred))

    train_scores = np.array(train_scores)
    val_scores = np.array(val_scores)

    # Test for normality
    _, p_train = shapiro(train_scores)
    _, p_val = shapiro(val_scores)
    alpha = 0.05

    # Choose appropriate test
    if p_train > alpha and p_val > alpha:
        # Use paired t-test
        test_stat, p_value = ttest_rel(train_scores, val_scores)
        test_used = "Paired t-test"
    else:
        # Use Wilcoxon signed-rank test
        test_stat, p_value = wilcoxon(train_scores, val_scores)
        test_used = "Wilcoxon test"

    # Calculate mean and median differences
    mean_difference = np.abs(np.mean(train_scores) - np.mean(val_scores))
    median_difference = np.abs(np.median(train_scores) - np.median(val_scores))

    # Interpret results
    interpretation = "Healthy (Generalization OK)" if p_value > alpha else "Problematic (Potential overfitting)"

    results = {
        "mean_difference": mean_difference,
        "median_difference": median_difference,
        "p_value": p_value,
        "test_used": test_used,
        "interpretation": interpretation,
    }

    return results

# Demonstration
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Generate synthetic classification data
X, y = make_classification(n_samples=500, n_features=20, n_informative=15, random_state=42)
X_train, X_test, y_train, y_test = X[:400], X[400:], y[:400], y[400:]

# Initialize model and metric
model = LogisticRegression(max_iter=1000)
metric = accuracy_score

# Run the diagnostic test
results = compare_train_test_scores(X_train, y_train, X_test, y_test, model, metric, k=10)

# Print the results
print("Results of Train vs Validation Score Comparison:")
for key, value in results.items():
    print(f"{key.capitalize()}: {value:.4f}" if isinstance(value, float) else f"{key.capitalize()}: {value}")

Example Output

Results of Train vs Validation Score Comparison:
Mean_difference: 0.0543
Median_difference: 0.0478
P_value: 0.0135
Test_used: Wilcoxon test
Interpretation: Problematic (Potential overfitting)

Key Features

  • K-Fold Cross-Validation: Performs robust validation by splitting the training data into multiple folds and aggregating scores.
  • Dynamic Normality Testing: Uses Shapiro-Wilk test to check if train and validation scores follow a normal distribution.
  • Appropriate Hypothesis Testing: Automatically selects a paired t-test for normal data or a Wilcoxon test for non-normal data.
  • Meaningful Metrics: Reports mean and median differences between train and validation scores for deeper insights.
  • Flexible Inputs: Works with any scikit-learn model and performance metric, supporting both classification and regression tasks.
  • Clear Interpretation: Provides a concise interpretation of results as “Healthy” or “Problematic” based on the statistical significance of differences.