Split Sensitivity Analysis By Data Distribution

Split Sensitivity Analysis By Data Distribution

Procedure

  1. Define Split Ratios

    • Identify train/test split ratios to evaluate, such as 50/50, 70/30, 80/20, and 90/10.
  2. Perform Random Splits

    • For each split ratio, generate multiple random train/test splits (e.g., 10–20 iterations per ratio) to account for variability in results.
  3. Compare Variable Distributions

    • Use distribution comparison tests (e.g., Kolmogorov-Smirnov Test or Anderson-Darling Test) to evaluate the similarity of sample distributions for each variable between the train and test sets in each split.
  4. Record Consistency Metrics

    • For each split ratio and iteration, determine whether the distributions for each variable in the train and test sets are statistically similar (e.g., p-value > 0.05).
    • Calculate the percentage of variables that maintain similar distributions for each split ratio.
  5. Aggregate Results Across Iterations

    • Compute the average percentage of variables with similar distributions for each split ratio across all iterations.
    • Identify split ratios where the results are consistently stable and have a high percentage of matching distributions.
  6. Report Findings

    • Summarize the distribution comparison results for each split ratio, including:
      • Average percentage of variables with similar distributions.
      • Number of iterations where all variables had matching distributions.
    • Highlight the split ratio(s) with the best performance based on these metrics.

Code Example

Below is a Python implementation of the diagnostic test that evaluates train/test split ratios by comparing the distributions of numerical variables between the train and test sets.

import numpy as np
import pandas as pd
from scipy.stats import ks_2samp
from sklearn.model_selection import train_test_split

def split_sensitivity_analysis(X, y, split_ratios, num_trials=30, alpha=0.05):
    """
    Perform sensitivity analysis of train/test split sizes by comparing variable distributions.

    Parameters:
        X (np.ndarray): Feature dataset (numerical variables only).
        y (np.ndarray): Target array (not used in this function but included for completeness).
        split_ratios (list of tuples): List of train/test split ratios (e.g., [(0.5, 0.5), (0.8, 0.2)]).
        num_trials (int): Number of random splits to test for each ratio.
        alpha (float): Significance level for statistical tests.

    Returns:
        dict: A dictionary with split ratios as keys and results for each variable as details.
    """
    n_features = X.shape[1]
    results = {}

    for train_ratio, test_ratio in split_ratios:
        ratio_key = f"Train:{train_ratio}, Test:{test_ratio}"
        results[ratio_key] = {f"Variable_{i}": [] for i in range(n_features)}

        for _ in range(num_trials):
            X_train, X_test, _, _ = train_test_split(X, y, train_size=train_ratio, test_size=test_ratio)

            for i in range(n_features):
                stat, p_value = ks_2samp(X_train[:, i], X_test[:, i])
                results[ratio_key][f"Variable_{i}"].append(p_value > alpha)

        for var in results[ratio_key]:
            results[ratio_key][var] = np.mean(results[ratio_key][var]) * 100  # Percentage of consistent trials

    return results

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic dataset
    from sklearn.datasets import make_regression
    X, y = make_regression(n_samples=1000, n_features=5, noise=0.1, random_state=42)

    # Define split ratios to evaluate
    split_ratios = [(0.5, 0.5), (0.7, 0.3), (0.8, 0.2), (0.9, 0.1)]

    # Run the sensitivity analysis
    results = split_sensitivity_analysis(X, y, split_ratios, num_trials=30, alpha=0.05)

    # Print Results
    for split, variables in results.items():
        print(f"Split Ratio {split}:")
        for var, percentage in variables.items():
            print(f"  - {var}: {percentage:.2f}% consistent splits")
        print("-" * 40)

Example Output

Split Ratio Train:0.5, Test:0.5:
  - Variable_0: 90.00% consistent splits
  - Variable_1: 93.33% consistent splits
  - Variable_2: 93.33% consistent splits
  - Variable_3: 93.33% consistent splits
  - Variable_4: 96.67% consistent splits
----------------------------------------
Split Ratio Train:0.7, Test:0.3:
  - Variable_0: 100.00% consistent splits
  - Variable_1: 100.00% consistent splits
  - Variable_2: 96.67% consistent splits
  - Variable_3: 100.00% consistent splits
  - Variable_4: 96.67% consistent splits
----------------------------------------
Split Ratio Train:0.8, Test:0.2:
  - Variable_0: 96.67% consistent splits
  - Variable_1: 96.67% consistent splits
  - Variable_2: 96.67% consistent splits
  - Variable_3: 90.00% consistent splits
  - Variable_4: 90.00% consistent splits
----------------------------------------
Split Ratio Train:0.9, Test:0.1:
  - Variable_0: 93.33% consistent splits
  - Variable_1: 96.67% consistent splits
  - Variable_2: 90.00% consistent splits
  - Variable_3: 93.33% consistent splits
  - Variable_4: 96.67% consistent splits
----------------------------------------