Check Score Distributions
Assumptions
- Predefined Dataset Split: The training dataset is already split into train/test sets.
- Model and Metric: A single machine learning model and performance metric (e.g., accuracy, F1 score, RMSE) are defined for evaluation.
Procedure
-
Perform 10-Fold Cross-Validation
- What to do: Train the model using 9 folds and validate on the remaining fold. Repeat for all 10 folds.
- Data Collection: Record the performance scores for the model on both the train folds and validation (test) folds for each iteration.
-
Compile Performance Scores
- What to do: Gather all performance scores from the train folds into one sample and all scores from the test folds into another sample.
-
Compare Distributions of Train and Test Scores
- What to do:
- Use a statistical test to compare the distributions of train and test scores:
- Apply the Kolmogorov-Smirnov Test for non-parametric comparison of distributions.
- Alternatively, use the Anderson-Darling Test for a similar purpose.
- Record the test statistic and p-value.
- Use a statistical test to compare the distributions of train and test scores:
- What to do:
-
Interpret Statistical Test Results
- What to do: Based on the p-value:
- If p ≤ 0.05, reject the null hypothesis (indicating distributions are significantly different).
- If p > 0.05, fail to reject the null hypothesis (indicating no significant difference in distributions).
- What to do: Based on the p-value:
-
Report Results
- What to do:
- Summarize train and test score statistics (e.g., mean, standard deviation).
- Include the test statistic, p-value, and a clear conclusion regarding the similarity of the distributions.
- What to do:
Interpretation
Outcome
- Results Provided:
- Statistical test result (p-value) indicating whether train and test score distributions differ significantly.
- Summary statistics for the train and test scores (e.g., mean, standard deviation).
- Visual comparison (optional) using boxplots or histograms.
Healthy/Problematic
- Healthy Distribution:
- Train and test score distributions are similar (p > 0.05), indicating that the model’s performance generalizes well across folds.
- Minimal difference between train and test statistics suggests no overfitting.
- Problematic Distribution:
- Train and test score distributions are significantly different (p ≤ 0.05), pointing to overfitting or issues in model generalization.
- A large gap in mean or variance between train and test scores warrants further investigation.
Limitations
- Data Size Sensitivity:
- Small datasets can lead to unreliable statistical comparisons due to limited sample size.
- Metric-Specific:
- Results depend on the performance metric chosen; alternative metrics might reveal different insights.
- Assumption of Independence:
- Assumes folds are independent and representative; imbalanced or biased folds can skew results.
Code Example
This function performs k-fold cross-validation on a dataset, calculates train and test fold performance scores, and evaluates their distributions using statistical tests to detect overfitting or generalization issues.
import numpy as np
from sklearn.model_selection import KFold
from scipy.stats import ks_2samp, anderson_ksamp
def compare_score_distributions(train_X, train_y, model, metric, k=10):
"""
Performs k-fold cross-validation, gathers train and test fold scores,
and compares their distributions using statistical tests.
Parameters:
- train_X: Numpy array of training input features.
- train_y: Numpy array of training target values.
- model: An sklearn-compatible model (e.g., LogisticRegression, RandomForest).
- metric: A performance metric function (e.g., accuracy_score, f1_score).
- k: Number of folds for cross-validation (default is 10).
Returns:
- results: Dictionary containing test statistics, p-value, test method, and interpretation.
"""
kf = KFold(n_splits=k, shuffle=True, random_state=42)
train_scores = []
test_scores = []
for train_index, test_index in kf.split(train_X):
X_train_fold, X_test_fold = train_X[train_index], train_X[test_index]
y_train_fold, y_test_fold = train_y[train_index], train_y[test_index]
model.fit(X_train_fold, y_train_fold)
y_train_pred = model.predict(X_train_fold)
y_test_pred = model.predict(X_test_fold)
train_score = metric(y_train_fold, y_train_pred)
test_score = metric(y_test_fold, y_test_pred)
train_scores.append(train_score)
test_scores.append(test_score)
train_scores = np.array(train_scores)
test_scores = np.array(test_scores)
# Compare distributions using statistical tests
ks_stat, ks_p_value = ks_2samp(train_scores, test_scores)
ad_result = anderson_ksamp([train_scores, test_scores])
# Select result to report
if ks_p_value < 0.05:
interpretation = "Different distributions (Potential overfitting)"
test_used = "Kolmogorov-Smirnov Test"
p_value = ks_p_value
test_statistic = ks_stat
else:
interpretation = "Similar distributions (Generalization OK)"
test_used = "Kolmogorov-Smirnov Test"
p_value = ks_p_value
test_statistic = ks_stat
results = {
"test_statistic": test_statistic,
"p_value": p_value,
"test_method": test_used,
"interpretation": interpretation,
}
return results
# Demonstration
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Generate synthetic classification data
X, y = make_classification(n_samples=500, n_features=20, n_informative=15, random_state=42)
# Initialize model and metric
model = LogisticRegression(max_iter=1000)
metric = accuracy_score
# Perform the diagnostic test
results = compare_score_distributions(X, y, model, metric, k=10)
# Print the results
print("Results:")
for key, value in results.items():
print(f"{key}: {value}")
Example Output
Results:
test_statistic: 0.5
p_value: 0.16782134274394334
test_method: Kolmogorov-Smirnov Test
interpretation: Similar distributions (Generalization OK)
Key Features
- K-Fold Cross-Validation: Collects performance scores from train and test folds for robust evaluation.
- Distribution Comparison: Employs statistical tests (Kolmogorov-Smirnov and Anderson-Darling) to compare score distributions.
- Interpretation: Provides a concise interpretation of whether the model is overfitting or generalizing well.
- Versatile Input: Works with classification or regression models and customizable performance metrics.
- Simplified Output: Returns a dictionary for easy integration into larger workflows or reports.