Compare Score Divergence

Diagnostics

Procedure

This procedure evaluates whether model performance on the train and test sets has the same distribution, ensuring the generalizability of the model to new, unseen data.

Evaluate Performance on Train Set
- What to do: Apply the suite of machine learning algorithms to the train set using the chosen test harness.
  - Use a consistent evaluation method (e.g., 10-fold cross-validation) to assess performance.
  - Record the performance metric (e.g., accuracy, F1 score, RMSE) for each algorithm.
Evaluate Performance on Test Set
- What to do: Apply the same suite of algorithms to the test set using the same test harness.
  - Ensure the same performance metric is used for comparability.
  - Avoid data leakage by maintaining strict separation of train and test datasets.
Gather Performance Metrics
- What to do: Collect performance metrics from both train and test evaluations.
  - Organize these scores in a structured format (e.g., a table where rows represent algorithms and columns represent train and test performance).
Assess Divergence Between Distributions
- What to do: Compare the distributions of performance scores on train and test sets.
  - Use a divergence metric such as KL-divergence or JS-divergence to quantify the difference between distributions.
Interpret Divergence Results
- What to do: Examine the divergence metric and visualize the comparison.
  - Low divergence indicates consistent performance between train and test sets, supporting generalizability.
  - High divergence suggests potential issues, such as overfitting, underfitting, or an ineffective train/test split.
Diagnose Train/Test Split Issues
- What to do: If divergence is high, investigate potential causes.
  - Check for data leakage or skewed data splits.
  - Verify that the training data adequately represents the test data in terms of feature distributions.
Report Findings
- What to do: Summarize the evaluation outcomes and implications for generalization.
  - Include divergence metrics, visualizations, and key observations.
  - Suggest next steps if high divergence is detected, such as improving the train/test split or enhancing model regularization.

Code Example

This Python function evaluates the quality of a train/test split by comparing the divergence between the distributions of model performance metrics across a suite of machine learning algorithms.

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.dummy import DummyClassifier
from scipy.spatial.distance import jensenshannon
from scipy.stats import shapiro

def evaluate_train_test_split(X_train, y_train, X_test, y_test, metric, k):
    """
    Evaluate the quality of a train/test split by analyzing the divergence
    between the distributions of model performance metrics across a suite of algorithms.

    Parameters:
        X_train, y_train: Features and target for the training set.
        X_test, y_test: Features and target for the testing set.
        metric (callable): Performance metric function (e.g., accuracy_score).
        k (int): Number of folds for k-fold cross-validation.

    Returns:
        dict: Results containing divergence metrics and interpretation of the train/test split quality.
    """
    # Define a suite of machine learning algorithms
    algorithms = [
        LogisticRegression(max_iter=1000),
        DecisionTreeClassifier(),
        KNeighborsClassifier(),
        SVC(probability=True),
        RandomForestClassifier(n_estimators=50),
        GradientBoostingClassifier(),
        DummyClassifier(strategy="most_frequent"),
        DummyClassifier(strategy="stratified"),
    ]

    train_scores = []
    test_scores = []

    # Evaluate performance of each algorithm
    for model in algorithms:
        # Cross-validate on train set
        train_cv_scores = cross_val_score(model, X_train, y_train, scoring=make_scorer(metric), cv=k)
        train_scores.append(np.mean(train_cv_scores))

        # Cross-validate on test set
        test_cv_scores = cross_val_score(model, X_test, y_test, scoring=make_scorer(metric), cv=k)
        test_scores.append(np.mean(test_cv_scores))

    # Normalize scores for divergence calculation
    train_scores_normalized = np.array(train_scores) / np.sum(train_scores)
    test_scores_normalized = np.array(test_scores) / np.sum(test_scores)

    # Calculate Jensen-Shannon divergence
    divergence = jensenshannon(train_scores_normalized, test_scores_normalized)

    # Test normality of scores
    _, p_train_normal = shapiro(train_scores)
    _, p_test_normal = shapiro(test_scores)

    # Interpret results
    alpha = 0.05
    if divergence < 0.1:
        interpretation = "Good train/test split: minimal divergence in performance distributions."
    elif 0.1 <= divergence < 0.3:
        interpretation = "Moderate train/test split: some divergence in performance distributions."
    else:
        interpretation = "Poor train/test split: high divergence suggests generalization issues."

    return {
        "N": len(algorithms),
        "Divergence": divergence,
        "p_train_normal": p_train_normal,
        "p_test_normal": p_test_normal,
        "Interpretation": interpretation
    }

# Demo Usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split

    # Generate synthetic dataset
    X, y = make_classification(
        n_samples=500, n_features=10, n_informative=5, random_state=42
    )

    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Perform the diagnostic test
    results = evaluate_train_test_split(
        X_train, y_train,
        X_test, y_test,
        metric=accuracy_score,
        k=10
    )

    # Print results
    print("Train/Test Split Diagnostic Results:")
    for key, value in results.items():
        print(f"{key}: {value}")

Example Output

Train/Test Split Diagnostic Results:
N: 8
Divergence: 0.012834262172191314
p_train_normal: 0.0010353786672046691
p_test_normal: 0.014716218131686956
Interpretation: Good train/test split: minimal divergence in performance distributions.

Compare Score Variances Compare Score Effect Size