Compare Score Distributions

Compare Score Distributions

Procedure

This procedure assesses whether model performance on the train and test sets follows the same distribution, ensuring generalization to new data. This helps verify the validity of the train/test split and identify potential issues like overfitting or underfitting.

  1. Evaluate Performance on Train Set

    • What to do: Apply each algorithm in the suite to the train set using the chosen test harness.
      • Use consistent evaluation methods, such as 10-fold cross-validation.
      • Record performance metrics (e.g., accuracy, RMSE, F1-score) for each algorithm.
  2. Evaluate Performance on Test Set

    • What to do: Apply the same algorithms to the test set using the chosen evaluation metric.
      • Ensure strict separation of train and test data to avoid data leakage.
      • Record the performance scores for comparison with the train set.
  3. Gather Performance Scores

    • What to do: Collect performance metrics from both train and test evaluations.
      • Organize the results in a structured table with rows for each algorithm and columns for train and test scores.
  4. Test Distribution Similarity

    • What to do: Use statistical tests to compare the distributions of performance scores from train and test sets.
      • If the data meets assumptions of normality, consider tests like Kolmogorov-Smirnov or Anderson-Darling.
      • For non-normal data, use non-parametric alternatives.
      • Null Hypothesis: Train and test scores come from the same distribution.
  5. Interpret the Results

    • What to do: Analyze the statistical test results.
      • If the p-value is high (e.g., > 0.05), fail to reject the null hypothesis, suggesting similar distributions.
      • If the p-value is low (e.g., ≤ 0.05), reject the null hypothesis, indicating possible issues with the train/test split.
  6. Investigate Discrepancies

    • What to do: If distributions differ, explore potential causes.
      • Check for data leakage, inadequate train set size, or overfitting to specific features.
      • Reassess preprocessing steps and splitting methods.
  7. Report the Findings

    • What to do: Summarize the test outcomes.
      • Include statistical test results, p-values, and performance metrics for both sets.
      • Provide interpretations and actionable recommendations based on the findings, such as refining the split strategy or improving data preprocessing.

Code Example

This Python function implements a diagnostic test to evaluate the quality of the train/test split by comparing the distributions of model performance metrics across a suite of machine learning algorithms.

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.dummy import DummyClassifier
from scipy.stats import ks_2samp

def evaluate_split_distribution(X_train, y_train, X_test, y_test, metric, k):
    """
    Evaluate the quality of the train/test split by comparing performance metric distributions.

    Parameters:
        X_train, y_train: Training features and labels.
        X_test, y_test: Testing features and labels.
        metric (callable): Performance metric function.
        k (int): Number of folds for k-fold cross-validation.

    Returns:
        dict: Results summarizing distribution comparison and split quality interpretation.
    """
    # Define a suite of machine learning algorithms
    algorithms = [
        LogisticRegression(max_iter=1000),
        DecisionTreeClassifier(),
        KNeighborsClassifier(),
        SVC(probability=True),
        RandomForestClassifier(n_estimators=50),
        GradientBoostingClassifier(),
        DummyClassifier(strategy="most_frequent"),
        DummyClassifier(strategy="stratified")
    ]

    train_scores = []
    test_scores = []

    # Evaluate each algorithm on train and test sets
    for model in algorithms:
        # Cross-validate on train set
        train_cv_score = cross_val_score(model, X_train, y_train, scoring=make_scorer(metric), cv=k)
        train_scores.append(np.mean(train_cv_score))

        # Cross-validate on test set
        test_cv_score = cross_val_score(model, X_test, y_test, scoring=make_scorer(metric), cv=k)
        test_scores.append(np.mean(test_cv_score))

    # Perform Kolmogorov-Smirnov test to compare distributions
    ks_stat, p_value = ks_2samp(train_scores, test_scores)

    # Interpret results
    alpha = 0.05
    if p_value > alpha:
        interpretation = "Train and test distributions are statistically similar. The split is likely adequate."
    else:
        interpretation = "Train and test distributions differ significantly. Consider revisiting the split or preprocessing."

    return {
        "KS Statistic": ks_stat,
        "p-value": p_value,
        "Significance Level": alpha,
        "Interpretation": interpretation
    }

# Demo Usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split

    # Generate synthetic dataset
    X, y = make_classification(
        n_samples=500, n_features=10, n_informative=5, random_state=42
    )

    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Perform the diagnostic test
    results = evaluate_split_distribution(
        X_train, y_train,
        X_test, y_test,
        metric=accuracy_score,
        k=10
    )

    # Print results
    print("Train/Test Split Diagnostic Results:")
    for key, value in results.items():
        print(f"{key}: {value}")

Example Output

Train/Test Split Diagnostic Results:
KS Statistic: 0.375
p-value: 0.6601398601398599
Significance Level: 0.05
Interpretation: Train and test distributions are statistically similar. The split is likely adequate.