Split Sensitivity Analysis By Performance Distribution

Split Sensitivity Analysis By Performance Distribution

Procedure

  1. Define Split Ratios and Metric

    • Identify train/test split ratios to evaluate, such as 50/50, 70/30, 80/20, and 90/10.
    • Select a performance metric (e.g., accuracy, precision, recall, F1 score) for evaluation.
  2. Perform Model Evaluation with K-Fold Cross-Validation

    • For each split ratio, divide the train set using k-fold cross-validation (e.g., k=5).
    • Train a standard model, such as logistic regression, on the training folds and evaluate on the validation fold.
    • Record performance metrics for each fold.
  3. Evaluate Test Set Performance

    • Train the model on the entire train set for each split ratio.
    • Test the model on the reserved test set and record the performance metric.
  4. Compare Performance Distributions

    • Use statistical tests (e.g., Kolmogorov-Smirnov Test or Anderson-Darling Test) to compare the distribution of k-fold cross-validation performance metrics with the test set performance metric for each split ratio.
  5. Analyze Correlation and Stability

    • Count the number of cases where the train (k-fold) and test set distributions are statistically equivalent (e.g., p-value > 0.05).
    • Repeat the entire process multiple times (e.g., 10–20 trials per split ratio) to account for variability.
  6. Aggregate and Summarize Results

    • Calculate the percentage of trials where the train and test set distributions are statistically equivalent for each split ratio.
    • Identify the split ratio(s) that consistently show a high correlation between train and test performance metrics.
  7. Report Findings

    • Summarize the performance comparison results for each split ratio, including:
      • Average percentage of trials with equivalent distributions.
      • Number of trials where performance distributions are strongly correlated.
    • Highlight the split ratio(s) that provide stable and reliable performance evaluation.

Code Example

Below is a Python function that performs a diagnostic test to evaluate train/test split ratios by comparing model performance distributions using k-fold cross-validation and a chosen evaluation metric.

import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import make_scorer
from scipy.stats import ks_2samp

def model_split_sensitivity_analysis(X, y, split_ratios, model, metric, k=10, num_trials=30, alpha=0.05):
    """
    Perform sensitivity analysis of train/test split sizes by comparing model performance distributions.

    Parameters:
        X (np.ndarray): Feature dataset (numerical variables only).
        y (np.ndarray): Target array.
        split_ratios (list of tuples): List of train/test split ratios (e.g., [(0.5, 0.5), (0.8, 0.2)]).
        model: A scikit-learn compatible model instance.
        metric: A scikit-learn compatible scoring function (e.g., accuracy_score).
        k (int): Number of folds for cross-validation.
        num_trials (int): Number of random splits to test for each ratio.
        alpha (float): Significance level for the Kolmogorov-Smirnov test.

    Returns:
        dict: A dictionary with split ratios as keys and summary results of consistency for each trial.
    """
    scorer = make_scorer(metric)
    results = {}

    for train_ratio, test_ratio in split_ratios:
        ratio_key = f"Train:{train_ratio}, Test:{test_ratio}"
        results[ratio_key] = []

        for _ in range(num_trials):
            X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_ratio, test_size=test_ratio)

            # Cross-validation performance
            cv_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=scorer)

            # Test set performance
            model.fit(X_train, y_train)
            test_score = metric(y_test, model.predict(X_test))

            # Distribution comparison
            stat, p_value = ks_2samp(cv_scores, [test_score])
            results[ratio_key].append(p_value > alpha)  # Consistency if p-value > alpha

        # Summarize results for the split ratio
        results[ratio_key] = {
            "Consistent Trials (%)": np.mean(results[ratio_key]) * 100
        }

    return results

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic dataset
    from sklearn.datasets import make_classification
    from sklearn.metrics import accuracy_score
    from sklearn.linear_model import LogisticRegression

    X, y = make_classification(n_samples=1000, n_features=5, random_state=42)

    # Define split ratios to evaluate
    split_ratios = [(0.5, 0.5), (0.7, 0.3), (0.8, 0.2), (0.9, 0.1)]

    # Initialize model and metric
    model = LogisticRegression()
    metric = accuracy_score

    # Run the sensitivity analysis
    results = model_split_sensitivity_analysis(X, y, split_ratios, model, metric, k=10, num_trials=30, alpha=0.01)

    # Print Results
    for split, details in results.items():
        print(f"Split Ratio {split}:")
        for key, value in details.items():
            print(f"  - {key}: {value:.2f}%")
        print("-" * 40)

Example Output

Split Ratio Train:0.5, Test:0.5:
  - Consistent Trials (%): 100.00%
----------------------------------------
Split Ratio Train:0.7, Test:0.3:
  - Consistent Trials (%): 100.00%
----------------------------------------
Split Ratio Train:0.8, Test:0.2:
  - Consistent Trials (%): 100.00%
----------------------------------------
Split Ratio Train:0.9, Test:0.1:
  - Consistent Trials (%): 100.00%
----------------------------------------