Split Sensitivity Analysis By Performance Correlation

Diagnostics

Procedure

Define Split Sizes
- Determine the train/test split sizes to evaluate, such as 50/50, 70/30, 80/20, and 90/10.
Choose a Metric and Model
- Select a performance metric (e.g., accuracy, F1 score, or mean squared error) for evaluation.
- Choose a standard, well-performing model (e.g., random forest) to ensure consistent and reliable baseline performance.
Perform k-Fold Cross-Validation
- For each split size, conduct k-fold cross-validation to calculate the performance metric for each fold.
- Record the performance metric scores for each fold and split size.
Repeat Trials for Each Split Size
- Repeat the k-fold cross-validation process multiple times (e.g., 10–20 iterations) for each split size to account for random variability.
- Gather the sample of performance metric scores for each trial and split size.
Assess Correlation of Performance Scores
- Compute correlations (e.g., Pearson’s or Spearman’s) between performance scores for different split sizes across all trials.
- Record the percentage of trials where performance scores show strong correlations (e.g., correlation coefficient > 0.8).
Aggregate and Analyze Results
- Summarize the correlation results for each split size, including:
  - Percentage of trials with strong correlations.
  - Variability in correlations across split sizes.
- Identify the split size(s) that consistently yield strong correlations and stable performance metrics.
Report Findings
- Prepare a report that includes:
  - Comparison of correlation percentages across split sizes.
  - Insights into which split sizes provide the most stable and reliable performance metrics.
- Recommend the optimal split size based on evidence from the analysis.

Code Example

Below is a Python implementation of a diagnostic test that evaluates train/test split ratios by comparing the correlation of model performance metrics derived from k-fold cross-validation on the train and test sets across multiple trials.

import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from scipy.stats import spearmanr

def split_sensitivity_analysis(X, y, split_ratios, model, metric, num_trials=30, correlation_threshold=0.75, k=10):
    """
    Perform train/test split sensitivity analysis by evaluating correlation of performance scores.

    Parameters:
        X (np.ndarray): Feature dataset (numerical variables only).
        y (np.ndarray): Target array.
        split_ratios (list of floats): List of train/test split sizes (e.g., [0.5, 0.7, 0.8, 0.9]).
        model (sklearn model): Machine learning model to evaluate.
        metric (str): Scoring metric for evaluation.
        num_trials (int): Number of random splits to test for each ratio.
        correlation_threshold (float): Threshold to consider correlations as "highly correlated".
        k (float): Number of folds in k-fold cross-validation.

    Returns:
        dict: Dictionary with split ratios as keys and percentage of trials with high correlations as values.
    """

    results = {}
    for split_ratio in split_ratios:
        split_key = f"Train:{split_ratio[0]}, Test:{split_ratio[1]}"
        high_correlation_count = 0
        correlations = []

        for _ in range(num_trials):
            # Split the data
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, train_size=split_ratio[0], test_size=split_ratio[1],
                random_state=None, stratify=y
            )

            # Perform k-fold cross-validation on the train set
            train_scores = cross_val_score(model, X_train, y_train, scoring=metric, cv=k)

            # Perform k-fold cross-validation on the test set
            test_scores = cross_val_score(model, X_test, y_test, scoring=metric, cv=k)

            # Compute correlation between train and test scores
            corr, _ = spearmanr(train_scores, test_scores)
            correlations.append(corr)
            if corr >= correlation_threshold:
                high_correlation_count += 1

        # Calculate percentage of trials with high correlations
        results[split_key] = {
            "high_correlation_percentage": (high_correlation_count / num_trials) * 100,
            "mean_correlation": np.mean(correlations),
            "num_trials": num_trials,
        }

    return results

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic dataset
    from sklearn.datasets import make_classification
    X, y = make_classification(
        n_samples=100, n_features=10, n_informative=8, random_state=42
    )

    # Define split ratios to evaluate
    split_ratios = [(0.5, 0.5), (0.7, 0.3), (0.8, 0.2)]

    # Define model and metric
    from sklearn.linear_model import LogisticRegression
    model = LogisticRegression()
    metric = 'accuracy'

    # Run the sensitivity analysis
    results = split_sensitivity_analysis(
        X, y, split_ratios, model, metric, correlation_threshold=0.5
    )

    # Print Results
    for split, details in results.items():
        print(f"Split Ratio {split}:")
        print(f"  - High Correlation Percentage: {details['high_correlation_percentage']:.2f}%")
        print(f"  - Mean Correlation: {details['mean_correlation']}")
        print("-" * 40)

Example Output

Split Ratio Train:0.5, Test:0.5:
  - High Correlation Percentage: 10.00%
  - Mean Correlation: -0.028202575708872864
----------------------------------------
Split Ratio Train:0.7, Test:0.3:
  - High Correlation Percentage: 6.67%
  - Mean Correlation: 0.020054506714414368
----------------------------------------
Split Ratio Train:0.8, Test:0.2:
  - High Correlation Percentage: 0.00%
  - Mean Correlation: -0.07473857510562325
----------------------------------------

Split Sensitivity Analysis By Data Divergence Split Sensitivity Analysis By Performance Central Tendency