Split Sensitivity Analysis By Data Divergence

Split Sensitivity Analysis By Data Divergence

Procedure

  1. Define Split Ratios

    • Identify train/test split ratios to evaluate, such as 50/50, 70/30, 80/20, and 90/10.
  2. Prepare Data for Divergence Analysis

    • Normalize all variables in the dataset to ensure they are on the same scale, as divergence measures often require consistent scaling.
    • Handle missing data and inconsistencies to avoid issues during calculations.
  3. Perform Random Splits

    • For each split ratio, generate multiple random train/test splits (e.g., 10–20 iterations per ratio) to account for variability in results.
  4. Calculate Divergence Measures

    • For each split and each variable:
      • Compute a divergence measure such as KL-divergence or JS-divergence to assess how the distribution of the variable differs between the train and test sets.
      • Record the divergence scores for all variables in each split.
  5. Analyze Low-Divergence Outcomes

    • Determine the percentage of variables in each split with low divergence scores (e.g., below a pre-defined threshold).
    • Record the percentage for each iteration and average the results for each split ratio.
  6. Aggregate and Compare Results

    • Aggregate the results across all iterations for each split ratio.
    • Compare the average percentage of low-divergence variables between split ratios to identify trends or patterns.
  7. Report Findings

    • Summarize the percentage of variables with low divergence for each split ratio and iteration.
    • Highlight the split ratio(s) that consistently show the lowest divergence and the highest percentage of variables with similar distributions.
    • Provide visualizations (e.g., bar charts or tables) to aid in interpretation of the results.

Code Example

The following Python function implements a diagnostic test to evaluate train/test split ratios by calculating divergence between the distributions of numerical variables in the train and test datasets, using the Jensen-Shannon (JS) divergence measure.

import numpy as np
from sklearn.model_selection import train_test_split
from scipy.spatial.distance import jensenshannon
from sklearn.preprocessing import MinMaxScaler

def evaluate_split_divergence(X, y, split_ratios, num_trials=30, divergence_threshold=0.1):
    """
    Evaluate train/test split ratios by calculating the Jensen-Shannon divergence
    between distributions of numerical variables.

    Parameters:
        X (np.ndarray): Feature dataset (numerical variables only).
        y (np.ndarray): Target variable array (not used in the analysis but included for completeness).
        split_ratios (list of tuples): Train/test split ratios to evaluate (e.g., [(0.5, 0.5), (0.7, 0.3)]).
        num_trials (int): Number of random splits to test for each ratio.
        divergence_threshold (float): Threshold below which divergence is considered low.

    Returns:
        dict: A dictionary with split ratios as keys and low-divergence percentages for each variable.
    """
    n_features = X.shape[1]
    results = {}

    # Normalize the dataset to [0, 1] range for consistency
    scaler = MinMaxScaler()
    X = scaler.fit_transform(X)

    for train_ratio, test_ratio in split_ratios:
        ratio_key = f"Train:{train_ratio}, Test:{test_ratio}"
        results[ratio_key] = {f"Variable_{i}": [] for i in range(n_features)}

        for _ in range(num_trials):
            # Create train/test split
            X_train, X_test, _, _ = train_test_split(X, y, train_size=train_ratio, test_size=test_ratio)

            # Compare distributions for each variable using JS divergence
            for i in range(n_features):
                # Compute histograms for train and test distributions
                train_hist, _ = np.histogram(X_train[:, i], bins=20, range=(0, 1), density=True)
                test_hist, _ = np.histogram(X_test[:, i], bins=20, range=(0, 1), density=True)

                # Add small value to avoid division by zero
                train_hist += 1e-8
                test_hist += 1e-8

                # Normalize histograms to probabilities
                train_hist /= np.sum(train_hist)
                test_hist /= np.sum(test_hist)

                # Calculate JS divergence
                js_div = jensenshannon(train_hist, test_hist)

                # Record whether divergence is below threshold
                results[ratio_key][f"Variable_{i}"].append(js_div < divergence_threshold)

        # Calculate percentage of low-divergence splits per variable
        for var in results[ratio_key]:
            results[ratio_key][var] = np.mean(results[ratio_key][var]) * 100

    return results

# Demo Usage
if __name__ == "__main__":
    # Generate a synthetic dataset
    from sklearn.datasets import make_regression
    X, y = make_regression(n_samples=1000, n_features=5, noise=0.1, random_state=42)

    # Define train/test split ratios to evaluate
    split_ratios = [(0.5, 0.5), (0.7, 0.3), (0.8, 0.2), (0.9, 0.1)]

    # Perform divergence analysis
    results = evaluate_split_divergence(X, y, split_ratios, num_trials=30, divergence_threshold=0.1)

    # Print results
    for split, variables in results.items():
        print(f"Split Ratio {split}:")
        for var, percentage in variables.items():
            print(f"  - {var}: {percentage:.2f}% low-divergence splits")
        print("-" * 40)

Example Output

Split Ratio Train:0.5, Test:0.5:
  - Variable_0: 26.67% low-divergence splits
  - Variable_1: 43.33% low-divergence splits
  - Variable_2: 66.67% low-divergence splits
  - Variable_3: 76.67% low-divergence splits
  - Variable_4: 43.33% low-divergence splits
----------------------------------------
Split Ratio Train:0.7, Test:0.3:
  - Variable_0: 13.33% low-divergence splits
  - Variable_1: 13.33% low-divergence splits
  - Variable_2: 36.67% low-divergence splits
  - Variable_3: 53.33% low-divergence splits
  - Variable_4: 43.33% low-divergence splits
----------------------------------------
Split Ratio Train:0.8, Test:0.2:
  - Variable_0: 0.00% low-divergence splits
  - Variable_1: 10.00% low-divergence splits
  - Variable_2: 6.67% low-divergence splits
  - Variable_3: 6.67% low-divergence splits
  - Variable_4: 6.67% low-divergence splits
----------------------------------------
Split Ratio Train:0.9, Test:0.1:
  - Variable_0: 0.00% low-divergence splits
  - Variable_1: 0.00% low-divergence splits
  - Variable_2: 0.00% low-divergence splits
  - Variable_3: 0.00% low-divergence splits
  - Variable_4: 0.00% low-divergence splits
----------------------------------------