Split Sensitivity Analysis By Data Varaince

Split Sensitivity Analysis By Data Varaince

Procedure

  1. Define Split Ratios

    • Select split ratios to evaluate, such as 50/50, 70/30, 80/20, and 90/10.
    • Determine the number of trials to perform for each split ratio (e.g., 10–20 iterations per ratio).
  2. Generate Random Splits

    • For each split ratio, create random train/test splits multiple times to simulate variability and assess stability.
  3. Perform Variance Tests

    • For each split and trial, use variance comparison tests like the F-test or Levene’s test:
      • Test each input/target variable between train and test datasets.
      • Record whether the variances are statistically similar (e.g., p-value > 0.05).
  4. Calculate Consistency Metrics

    • For each split ratio and across all trials, calculate:
      • Percentage of variables with similar variances in train and test datasets.
      • Trials where all variables have matching variances.
  5. Aggregate Results Across Splits

    • Summarize results for each split ratio by:
      • Averaging the percentage of variables with similar variances across trials.
      • Counting the number of trials where all variables match variances.
  6. Report Findings

    • Present findings in a clear, tabular, or graphical format:
      • Highlight split ratios with consistently high variance similarity across variables.
      • Identify any split ratios where none or only a few variables maintain similar variances.
    • Provide recommendations for optimal split ratios based on observed trends and stability.

Code Example

Below is a Python function that evaluates train/test split ratios by comparing the variance consistency of numerical variables across splits.

import numpy as np
from sklearn.model_selection import train_test_split
from scipy.stats import levene

def split_variance_sensitivity_analysis(X, y, split_ratios, num_trials=30, alpha=0.05):
    """
    Perform variance sensitivity analysis for train/test splits across numerical variables.

    Parameters:
        X (np.ndarray): Feature dataset (numerical variables only).
        y (np.ndarray): Target array (not used in this function but included for completeness).
        split_ratios (list of tuples): List of train/test split ratios (e.g., [(0.7, 0.3), (0.8, 0.2)]).
        num_trials (int): Number of random splits to test for each ratio.
        alpha (float): Significance level for Levene’s test.

    Returns:
        dict: A dictionary with split ratios as keys and consistency percentages for each variable.
    """
    n_features = X.shape[1]
    results = {}

    for train_ratio, test_ratio in split_ratios:
        ratio_key = f"Train:{train_ratio}, Test:{test_ratio}"
        results[ratio_key] = {f"Variable_{i}": [] for i in range(n_features)}

        for _ in range(num_trials):
            X_train, X_test, _, _ = train_test_split(X, y, train_size=train_ratio, test_size=test_ratio)

            for i in range(n_features):
                stat, p_value = levene(X_train[:, i], X_test[:, i])
                results[ratio_key][f"Variable_{i}"].append(p_value > alpha)

        for var in results[ratio_key]:
            results[ratio_key][var] = np.mean(results[ratio_key][var]) * 100  # Percentage of consistent splits

    return results

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic dataset
    from sklearn.datasets import make_regression
    X, y = make_regression(n_samples=1000, n_features=5, noise=0.1, random_state=42)

    # Define split ratios to evaluate
    split_ratios = [(0.5, 0.5), (0.7, 0.3), (0.8, 0.2), (0.9, 0.1)]

    # Run the sensitivity analysis
    results = split_variance_sensitivity_analysis(X, y, split_ratios, num_trials=30, alpha=0.05)

    # Print Results
    for split, variables in results.items():
        print(f"Split Ratio {split}:")
        for var, percentage in variables.items():
            print(f"  - {var}: {percentage:.2f}% consistent splits")
        print("-" * 40)

Example Output

Split Ratio Train:0.5, Test:0.5:
  - Variable_0: 86.67% consistent splits
  - Variable_1: 100.00% consistent splits
  - Variable_2: 96.67% consistent splits
  - Variable_3: 93.33% consistent splits
  - Variable_4: 100.00% consistent splits
----------------------------------------
Split Ratio Train:0.7, Test:0.3:
  - Variable_0: 96.67% consistent splits
  - Variable_1: 96.67% consistent splits
  - Variable_2: 90.00% consistent splits
  - Variable_3: 93.33% consistent splits
  - Variable_4: 93.33% consistent splits
----------------------------------------
Split Ratio Train:0.8, Test:0.2:
  - Variable_0: 93.33% consistent splits
  - Variable_1: 96.67% consistent splits
  - Variable_2: 96.67% consistent splits
  - Variable_3: 100.00% consistent splits
  - Variable_4: 86.67% consistent splits
----------------------------------------
Split Ratio Train:0.9, Test:0.1:
  - Variable_0: 96.67% consistent splits
  - Variable_1: 100.00% consistent splits
  - Variable_2: 96.67% consistent splits
  - Variable_3: 96.67% consistent splits
  - Variable_4: 96.67% consistent splits
----------------------------------------