Split Sensitivity Analysis By Performance Divergence

Split Sensitivity Analysis By Performance Divergence

Procedure

  1. Define Split Sizes and Divergence Metric

    • Identify train/test split sizes to evaluate, such as 50/50, 70/30, 80/20, and 90/10.
    • Select a divergence metric, such as KL-divergence or JS-divergence, to measure the difference between score distributions.
    • Ensure input data is normalized before performing divergence calculations.
  2. Perform Model Evaluation with K-Fold Cross-Validation

    • For each split size, divide the training set using k-fold cross-validation (e.g., k=5).
    • Train a naive or standard model, such as logistic regression, on the training folds and evaluate it on the validation fold.
    • Record the model’s performance scores for each fold.
  3. Evaluate Test Set Performance

    • Train the model on the entire training set for each split size.
    • Test the model on the reserved test set and record the performance scores.
  4. Calculate Divergence Scores

    • Compute the divergence between the k-fold cross-validation performance score distribution and the test set performance score distribution for each split size.
    • Record divergence scores for each split size.
  5. Count Cases of Low Divergence

    • Define a threshold for what constitutes “low divergence.”
    • Count the number of cases for each split size where the divergence score is below the threshold.
  6. Repeat and Aggregate Results

    • Repeat the entire process multiple times (e.g., 10–20 trials per split size) to account for variability.
    • Calculate the percentage of trials where each split size demonstrates low divergence.
  7. Analyze and Summarize Findings

    • Compare the percentages of trials with low divergence across different split sizes.
    • Identify split size(s) that consistently yield low divergence scores, indicating stable and reliable performance evaluation.
  8. Report Findings

    • Summarize the comparison of divergence scores for each split size, including:
      • Average percentage of trials with low divergence.
      • Total number of cases with low divergence across trials.
    • Recommend the split size(s) that demonstrate the best balance of performance stability and distribution alignment.

Code Example

This Python function evaluates train/test split ratios by counting the number of cases where the divergence between cross-validation (CV) scores on the training set and test set is low.

import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import make_scorer
from scipy.stats import entropy

def calculate_divergence(cv_scores_train, cv_scores_test, divergence_type="kl"):
    """
    Calculate divergence between two score distributions.

    Parameters:
        cv_scores_train (array-like): Cross-validation scores on the training set.
        cv_scores_test (array-like): Cross-validation scores on the test set.
        divergence_type (str): Type of divergence to calculate ("kl" or "js").

    Returns:
        float: Divergence value.
    """
    # Normalize distributions
    train_dist = np.histogram(cv_scores_train, bins=20, range=(0, 1), density=True)[0]
    test_dist = np.histogram(cv_scores_test, bins=20, range=(0, 1), density=True)[0]

    # Add small values to avoid zero-division
    train_dist += 1e-10
    test_dist += 1e-10

    if divergence_type == "kl":
        return entropy(train_dist, test_dist)
    elif divergence_type == "js":
        avg_dist = 0.5 * (train_dist + test_dist)
        return 0.5 * (entropy(train_dist, avg_dist) + entropy(test_dist, avg_dist))
    else:
        raise ValueError("Invalid divergence_type. Choose 'kl' or 'js'.")

def evaluate_split_ratios_with_low_divergence(X, y, split_ratios, model, metric, k=10, num_trials=30, threshold=0.1, divergence_type="js"):
    """
    Evaluate train/test split ratios by counting cases of low divergence between CV scores.

    Parameters:
        X (np.ndarray): Feature dataset (numerical variables only).
        y (np.ndarray): Target array.
        split_ratios (list of tuples): List of train/test split ratios to evaluate.
        model: A scikit-learn compatible model instance.
        metric: A scikit-learn compatible scoring function (e.g., accuracy_score).
        k (int): Number of folds for cross-validation.
        num_trials (int): Number of random trials per split ratio.
        threshold (float): Threshold for divergence to be considered "low."
        divergence_type (str): Type of divergence to calculate ("kl" or "js").

    Returns:
        dict: A dictionary with results for each split ratio, including the percentage of low-divergence trials.
    """
    scorer = make_scorer(metric)
    results = {}

    for train_ratio, test_ratio in split_ratios:
        split_key = f"Train:{train_ratio}, Test:{test_ratio}"
        low_divergence_count = 0

        for _ in range(num_trials):
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, train_size=train_ratio, test_size=test_ratio, random_state=None
            )

            # Cross-validation scores
            cv_scores_train = cross_val_score(model, X_train, y_train, cv=k, scoring=scorer)
            cv_scores_test = cross_val_score(model, X_test, y_test, cv=k, scoring=scorer)

            # Calculate divergence
            divergence = calculate_divergence(cv_scores_train, cv_scores_test, divergence_type)

            # Count low divergence cases
            if divergence <= threshold:
                low_divergence_count += 1

        # Aggregate results
        results[split_key] = {
            "Low Divergence Cases (%)": (low_divergence_count / num_trials) * 100,
        }

    return results

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic dataset
    from sklearn.datasets import make_classification
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score

    X, y = make_classification(n_samples=500, n_features=10, random_state=42)

    # Define train/test split ratios
    split_ratios = [(0.5, 0.5), (0.7, 0.3), (0.8, 0.2), (0.9, 0.1)]

    # Define model and metric
    model = LogisticRegression()
    metric = accuracy_score

    # Run the diagnostic test
    results = evaluate_split_ratios_with_low_divergence(
        X, y, split_ratios, model, metric, k=5, num_trials=30, threshold=0.1, divergence_type="js"
    )

    # Print results
    for split, details in results.items():
        print(f"Split Ratio {split}:")
        for key, value in details.items():
            print(f"  - {key}: {value}")
        print("-" * 40)

Example Output

Split Ratio Train:0.5, Test:0.5:
  - Low Divergence Cases (%): 33.33333333333333
----------------------------------------
Split Ratio Train:0.7, Test:0.3:
  - Low Divergence Cases (%): 26.666666666666668
----------------------------------------
Split Ratio Train:0.8, Test:0.2:
  - Low Divergence Cases (%): 13.333333333333334
----------------------------------------
Split Ratio Train:0.9, Test:0.1:
  - Low Divergence Cases (%): 3.3333333333333335
----------------------------------------