Check Score Distributions

Diagnostics

Model Overfitting

Assumptions

Predefined Dataset Split: The training dataset is already split into train/test sets.
Model and Metric: A single machine learning model and performance metric (e.g., accuracy, F1 score, RMSE) are defined for evaluation.

Procedure

Perform 10-Fold Cross-Validation
- What to do: Train the model using 9 folds and validate on the remaining fold. Repeat for all 10 folds.
- Data Collection: Record the performance scores for the model on both the train folds and validation (test) folds for each iteration.
Compile Performance Scores
- What to do: Gather all performance scores from the train folds into one sample and all scores from the test folds into another sample.
Compare Distributions of Train and Test Scores
- What to do:
  - Use a statistical test to compare the distributions of train and test scores:
    - Apply the Kolmogorov-Smirnov Test for non-parametric comparison of distributions.
    - Alternatively, use the Anderson-Darling Test for a similar purpose.
  - Record the test statistic and p-value.
Interpret Statistical Test Results
- What to do: Based on the p-value:
  - If p ≤ 0.05, reject the null hypothesis (indicating distributions are significantly different).
  - If p > 0.05, fail to reject the null hypothesis (indicating no significant difference in distributions).
Report Results
- What to do:
  - Summarize train and test score statistics (e.g., mean, standard deviation).
  - Include the test statistic, p-value, and a clear conclusion regarding the similarity of the distributions.

Interpretation

Outcome

Results Provided:
- Statistical test result (p-value) indicating whether train and test score distributions differ significantly.
- Summary statistics for the train and test scores (e.g., mean, standard deviation).
- Visual comparison (optional) using boxplots or histograms.

Healthy/Problematic

Healthy Distribution:
- Train and test score distributions are similar (p > 0.05), indicating that the model’s performance generalizes well across folds.
- Minimal difference between train and test statistics suggests no overfitting.
Problematic Distribution:
- Train and test score distributions are significantly different (p ≤ 0.05), pointing to overfitting or issues in model generalization.
- A large gap in mean or variance between train and test scores warrants further investigation.

Limitations

Data Size Sensitivity:
- Small datasets can lead to unreliable statistical comparisons due to limited sample size.
Metric-Specific:
- Results depend on the performance metric chosen; alternative metrics might reveal different insights.
Assumption of Independence:
- Assumes folds are independent and representative; imbalanced or biased folds can skew results.

Code Example

This function performs k-fold cross-validation on a dataset, calculates train and test fold performance scores, and evaluates their distributions using statistical tests to detect overfitting or generalization issues.

import numpy as np
from sklearn.model_selection import KFold
from scipy.stats import ks_2samp, anderson_ksamp

def compare_score_distributions(train_X, train_y, model, metric, k=10):
    """
    Performs k-fold cross-validation, gathers train and test fold scores,
    and compares their distributions using statistical tests.

    Parameters:
    - train_X: Numpy array of training input features.
    - train_y: Numpy array of training target values.
    - model: An sklearn-compatible model (e.g., LogisticRegression, RandomForest).
    - metric: A performance metric function (e.g., accuracy_score, f1_score).
    - k: Number of folds for cross-validation (default is 10).

    Returns:
    - results: Dictionary containing test statistics, p-value, test method, and interpretation.
    """
    kf = KFold(n_splits=k, shuffle=True, random_state=42)
    train_scores = []
    test_scores = []

    for train_index, test_index in kf.split(train_X):
        X_train_fold, X_test_fold = train_X[train_index], train_X[test_index]
        y_train_fold, y_test_fold = train_y[train_index], train_y[test_index]

        model.fit(X_train_fold, y_train_fold)

        y_train_pred = model.predict(X_train_fold)
        y_test_pred = model.predict(X_test_fold)

        train_score = metric(y_train_fold, y_train_pred)
        test_score = metric(y_test_fold, y_test_pred)

        train_scores.append(train_score)
        test_scores.append(test_score)

    train_scores = np.array(train_scores)
    test_scores = np.array(test_scores)

    # Compare distributions using statistical tests
    ks_stat, ks_p_value = ks_2samp(train_scores, test_scores)
    ad_result = anderson_ksamp([train_scores, test_scores])

    # Select result to report
    if ks_p_value < 0.05:
        interpretation = "Different distributions (Potential overfitting)"
        test_used = "Kolmogorov-Smirnov Test"
        p_value = ks_p_value
        test_statistic = ks_stat
    else:
        interpretation = "Similar distributions (Generalization OK)"
        test_used = "Kolmogorov-Smirnov Test"
        p_value = ks_p_value
        test_statistic = ks_stat

    results = {
        "test_statistic": test_statistic,
        "p_value": p_value,
        "test_method": test_used,
        "interpretation": interpretation,
    }

    return results

# Demonstration
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Generate synthetic classification data
X, y = make_classification(n_samples=500, n_features=20, n_informative=15, random_state=42)

# Initialize model and metric
model = LogisticRegression(max_iter=1000)
metric = accuracy_score

# Perform the diagnostic test
results = compare_score_distributions(X, y, model, metric, k=10)

# Print the results
print("Results:")
for key, value in results.items():
    print(f"{key}: {value}")

Example Output

Results:
test_statistic: 0.5
p_value: 0.16782134274394334
test_method: Kolmogorov-Smirnov Test
interpretation: Similar distributions (Generalization OK)

Key Features

K-Fold Cross-Validation: Collects performance scores from train and test folds for robust evaluation.
Distribution Comparison: Employs statistical tests (Kolmogorov-Smirnov and Anderson-Darling) to compare score distributions.
Interpretation: Provides a concise interpretation of whether the model is overfitting or generalizing well.
Versatile Input: Works with classification or regression models and customizable performance metrics.
Simplified Output: Returns a dictionary for easy integration into larger workflows or reports.

Compare Score Central Tendencies Compare Scores Visually