Compare Score Central Tendencies
Assumptions
- Predefined Dataset Split: The training dataset is already split into train/test sets.
- Model and Metric: A single machine learning model and performance metric (e.g., accuracy, F1 score, RMSE) are defined for evaluation.
Procedure
-
Perform 10-Fold Cross-Validation
- What to do: Split the training dataset into 10 folds. For each fold:
- Train the model on 9 folds.
- Evaluate performance on the held-out fold.
- Data Collection: Gather the performance scores for both train folds and test folds across all iterations.
- What to do: Split the training dataset into 10 folds. For each fold:
-
Check Normality of Score Distributions
- What to do: Use statistical tests and visualizations to assess if train and test fold performance score distributions are normal:
- Apply the Shapiro-Wilk test.
- Generate histograms or Q-Q plots for visual confirmation of normality.
- What to do: Use statistical tests and visualizations to assess if train and test fold performance score distributions are normal:
-
Compare Central Tendencies
- What to do: Select the hypothesis test based on the normality of score distributions:
- If normal, use a paired t-test to compare mean scores.
- If non-normal, use the Wilcoxon signed-rank test to compare median scores.
- What to do: Select the hypothesis test based on the normality of score distributions:
-
Calculate the Absolute Difference in Central Tendencies
- What to do: Measure the absolute difference between the central tendencies (mean or median) of train and test fold scores.
- Record this difference to assess the gap between training and testing performance.
-
Interpret and Report Results
- What to do: Present findings in a clear table:
- Include fold numbers, train scores, test scores, p-values, and the absolute difference in central tendencies.
- What to do: Present findings in a clear table:
Interpretation
Outcome
- Results Provided:
- Central tendencies (mean or median) of train and test fold scores.
- p-value from the hypothesis test indicating whether the difference is statistically significant.
- Absolute difference in central tendencies.
Healthy/Problematic
- Healthy Central Tendencies:
- Small absolute differences in central tendencies with non-significant p-values (e.g., p > 0.05) indicate consistent performance between train and test folds.
- Suggests the model generalizes well without significant overfitting.
- Problematic Central Tendencies:
- Large absolute differences or significant p-values (e.g., p ≤ 0.05) suggest potential overfitting or an inability to generalize.
- Train fold performance significantly exceeding test fold performance indicates overfitting.
Limitations
- Model and Metric Dependence: Results are specific to the chosen model and performance metric; testing with multiple metrics may provide a fuller picture.
- Dataset Size and Balance: Small or imbalanced datasets may result in unreliable results or biased folds.
- Assumption of Fold Independence: Assumes that folds represent independent samples, which may not hold in cases of temporal or spatial correlation.
Code Example
This function performs k-fold cross-validation on provided train/test datasets, collects training and validation scores, applies statistical tests to compare their central tendencies, and interprets the results.
import numpy as np
from sklearn.model_selection import KFold
from scipy.stats import shapiro, ttest_rel, wilcoxon
def compare_train_test_scores(train_X, train_y, test_X, test_y, model, metric, k=10):
"""
Performs k-fold cross-validation on the training data, gathers train and validation scores,
compares their central tendencies, and interprets the results.
Parameters:
- train_X: Numpy array of training input features.
- train_y: Numpy array of training target values.
- test_X: Numpy array of test input features.
- test_y: Numpy array of test target values.
- model: An sklearn estimator (e.g., LogisticRegression(), RandomForestRegressor()).
- metric: Performance metric function (e.g., accuracy_score, r2_score).
- k: Number of folds for cross-validation (default is 10).
Returns:
- results: Dictionary containing mean/median differences, p-value, test used, and an interpretation.
"""
kf = KFold(n_splits=k, shuffle=True, random_state=42)
train_scores = []
val_scores = []
for train_index, val_index in kf.split(train_X):
X_train_fold, X_val_fold = train_X[train_index], train_X[val_index]
y_train_fold, y_val_fold = train_y[train_index], train_y[val_index]
model.fit(X_train_fold, y_train_fold)
train_pred = model.predict(X_train_fold)
val_pred = model.predict(X_val_fold)
train_scores.append(metric(y_train_fold, train_pred))
val_scores.append(metric(y_val_fold, val_pred))
train_scores = np.array(train_scores)
val_scores = np.array(val_scores)
# Test for normality
_, p_train = shapiro(train_scores)
_, p_val = shapiro(val_scores)
alpha = 0.05
# Choose appropriate test
if p_train > alpha and p_val > alpha:
# Use paired t-test
test_stat, p_value = ttest_rel(train_scores, val_scores)
test_used = "Paired t-test"
else:
# Use Wilcoxon signed-rank test
test_stat, p_value = wilcoxon(train_scores, val_scores)
test_used = "Wilcoxon test"
# Calculate mean and median differences
mean_difference = np.abs(np.mean(train_scores) - np.mean(val_scores))
median_difference = np.abs(np.median(train_scores) - np.median(val_scores))
# Interpret results
interpretation = "Healthy (Generalization OK)" if p_value > alpha else "Problematic (Potential overfitting)"
results = {
"mean_difference": mean_difference,
"median_difference": median_difference,
"p_value": p_value,
"test_used": test_used,
"interpretation": interpretation,
}
return results
# Demonstration
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Generate synthetic classification data
X, y = make_classification(n_samples=500, n_features=20, n_informative=15, random_state=42)
X_train, X_test, y_train, y_test = X[:400], X[400:], y[:400], y[400:]
# Initialize model and metric
model = LogisticRegression(max_iter=1000)
metric = accuracy_score
# Run the diagnostic test
results = compare_train_test_scores(X_train, y_train, X_test, y_test, model, metric, k=10)
# Print the results
print("Results of Train vs Validation Score Comparison:")
for key, value in results.items():
print(f"{key.capitalize()}: {value:.4f}" if isinstance(value, float) else f"{key.capitalize()}: {value}")
Example Output
Results of Train vs Validation Score Comparison:
Mean_difference: 0.0543
Median_difference: 0.0478
P_value: 0.0135
Test_used: Wilcoxon test
Interpretation: Problematic (Potential overfitting)
Key Features
- K-Fold Cross-Validation: Performs robust validation by splitting the training data into multiple folds and aggregating scores.
- Dynamic Normality Testing: Uses Shapiro-Wilk test to check if train and validation scores follow a normal distribution.
- Appropriate Hypothesis Testing: Automatically selects a paired t-test for normal data or a Wilcoxon test for non-normal data.
- Meaningful Metrics: Reports mean and median differences between train and validation scores for deeper insights.
- Flexible Inputs: Works with any scikit-learn model and performance metric, supporting both classification and regression tasks.
- Clear Interpretation: Provides a concise interpretation of results as “Healthy” or “Problematic” based on the statistical significance of differences.