Split Sensitivity Analysis By Data Central Tendency

Split Sensitivity Analysis By Data Central Tendency

We can perform a sensitivity analysis of train/test split sizes using comparisons of variable distributions.

Statistical hypothesis tests can be used to evaluate how well different train/test split ratios preserve the central tendencies and distributions of numerical variables, ensuring that the training and testing sets represent the dataset consistently.

Procedure

  1. Define Split Ratios: Evaluate various train/test split ratios ranging from 50/50 to 90/10.

  2. Normality Check:

    • Perform a normality test (e.g., Shapiro-Wilk, Kolmogorov-Smirnov) on the whole dataset for each variable.
    • Based on the result, classify variables as normally or non-normally distributed.
  3. Statistical Test Selection:

    • For normally distributed variables, use parametric tests (e.g., t-test) to compare the central tendencies between the train and test sets.
    • For non-normally distributed variables, use non-parametric tests (e.g., Mann-Whitney U test).
  4. Random Splitting:

    • For each split ratio, create multiple random train/test splits to account for variability.
    • Apply the selected statistical tests to each variable in each split.
  5. Assessment Criteria:

    • Determine whether the central tendencies of the variables in the train and test sets match within an acceptable significance level (e.g., p-value threshold of 0.05).
    • Track:
      • The percentage of variables with matching distributions.
      • Whether all or none of the variables match for a given split.
  6. Evaluation of Split Quality:

    • A “good” split maintains a high percentage of variables with matching distributions across train and test sets.
    • Consistency in distribution matching across multiple random splits of the same ratio is also evaluated.

Code Example

Below is a Python implementation of the described sensitivity analysis. The function evaluates train/test split ratios, assesses the distributional similarity of each variable, and reports the results.

import numpy as np
import pandas as pd
from scipy.stats import shapiro, ttest_ind, mannwhitneyu
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

def sensitivity_analysis(dataset, split_ratios, num_trials=30, alpha=0.05):
    """
    Perform sensitivity analysis of train/test split sizes on numerical variables.

    Parameters:
        dataset (pd.DataFrame): Dataset with numerical variables.
        split_ratios (list of tuples): List of train/test split ratios (e.g., [(0.5, 0.5), (0.8, 0.2)]).
        num_trials (int): Number of random splits to test for each ratio.
        alpha (float): Significance level for statistical tests.

    Returns:
        pd.DataFrame: Results showing the percentage of representative splits for each variable and ratio.
    """
    # Initialize result storage
    results = {ratio: {col: [] for col in dataset.columns} for ratio in split_ratios}

    # Determine normality of each variable
    normality = {col: shapiro(dataset[col])[1] > alpha for col in dataset.columns}

    # Perform sensitivity analysis
    for ratio in split_ratios:
        train_size, test_size = ratio
        for _ in range(num_trials):
            train, test = train_test_split(dataset, train_size=train_size, test_size=test_size, random_state=None)

            for col in dataset.columns:
                if normality[col]:  # Use t-test for normally distributed variables
                    stat, p_value = ttest_ind(train[col], test[col], equal_var=False)
                else:  # Use Mann-Whitney U test for non-normally distributed variables
                    stat, p_value = mannwhitneyu(train[col], test[col])

                # Record whether the distributions are statistically similar
                results[ratio][col].append(p_value > alpha)

    # Summarize results
    summary = []
    for ratio, variables in results.items():
        for col, outcomes in variables.items():
            representative_percentage = np.mean(outcomes) * 100
            summary.append({"Split Ratio": ratio,
                            "Variable": col,
                            "Normal": normality[col],
                            "Representative %": representative_percentage})

    return pd.DataFrame(summary)

def print_results_nicely(results_df):
    """
    Prints the results of sensitivity analysis in a formatted way.

    Parameters:
        results_df (pd.DataFrame): The results dataframe with columns 'Split Ratio', 'Variable',
                                   and 'Representative %'.
    """
    grouped = results_df.groupby('Split Ratio')

    for split_ratio, group in grouped:
        print(f"Split Ratio {split_ratio}:")
        for _, row in group.iterrows():
            print(f"  - {row['Variable']}: (normal: {str(row["Normal"]):5s}) {row['Representative %']:.2f}% representative splits")
        print("-" * 40)  # Separator for readability

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=4, n_redundant=1, random_state=42)
df = pd.DataFrame(X, columns=[f"Feature_{i}" for i in range(X.shape[1])])

# Define split ratios to test
split_ratios = [(0.5, 0.5), (0.7, 0.3), (0.8, 0.2), (0.9, 0.1)]

# Perform sensitivity analysis
results = sensitivity_analysis(df, split_ratios, num_trials=30)

# Print the formatted results
print_results_nicely(results)

Example Output

Split Ratio (0.5, 0.5):
  - Feature_0: (normal: False) 86.67% representative splits
  - Feature_1: (normal: True ) 96.67% representative splits
  - Feature_2: (normal: False) 80.00% representative splits
  - Feature_3: (normal: False) 86.67% representative splits
  - Feature_4: (normal: False) 90.00% representative splits
----------------------------------------
Split Ratio (0.7, 0.3):
  - Feature_0: (normal: False) 93.33% representative splits
  - Feature_1: (normal: True ) 96.67% representative splits
  - Feature_2: (normal: False) 100.00% representative splits
  - Feature_3: (normal: False) 100.00% representative splits
  - Feature_4: (normal: False) 100.00% representative splits
----------------------------------------
Split Ratio (0.8, 0.2):
  - Feature_0: (normal: False) 96.67% representative splits
  - Feature_1: (normal: True ) 90.00% representative splits
  - Feature_2: (normal: False) 93.33% representative splits
  - Feature_3: (normal: False) 90.00% representative splits
  - Feature_4: (normal: False) 96.67% representative splits
----------------------------------------
Split Ratio (0.9, 0.1):
  - Feature_0: (normal: False) 93.33% representative splits
  - Feature_1: (normal: True ) 93.33% representative splits
  - Feature_2: (normal: False) 93.33% representative splits
  - Feature_3: (normal: False) 86.67% representative splits
  - Feature_4: (normal: False) 96.67% representative splits
----------------------------------------