Split Sensitivity Analysis By Data Central Tendency
We can perform a sensitivity analysis of train/test split sizes using comparisons of variable distributions.
Statistical hypothesis tests can be used to evaluate how well different train/test split ratios preserve the central tendencies and distributions of numerical variables, ensuring that the training and testing sets represent the dataset consistently.
Procedure
-
Define Split Ratios: Evaluate various train/test split ratios ranging from 50/50 to 90/10.
-
Normality Check:
- Perform a normality test (e.g., Shapiro-Wilk, Kolmogorov-Smirnov) on the whole dataset for each variable.
- Based on the result, classify variables as normally or non-normally distributed.
-
Statistical Test Selection:
- For normally distributed variables, use parametric tests (e.g., t-test) to compare the central tendencies between the train and test sets.
- For non-normally distributed variables, use non-parametric tests (e.g., Mann-Whitney U test).
-
Random Splitting:
- For each split ratio, create multiple random train/test splits to account for variability.
- Apply the selected statistical tests to each variable in each split.
-
Assessment Criteria:
- Determine whether the central tendencies of the variables in the train and test sets match within an acceptable significance level (e.g., p-value threshold of 0.05).
- Track:
- The percentage of variables with matching distributions.
- Whether all or none of the variables match for a given split.
-
Evaluation of Split Quality:
- A “good” split maintains a high percentage of variables with matching distributions across train and test sets.
- Consistency in distribution matching across multiple random splits of the same ratio is also evaluated.
Code Example
Below is a Python implementation of the described sensitivity analysis. The function evaluates train/test split ratios, assesses the distributional similarity of each variable, and reports the results.
import numpy as np
import pandas as pd
from scipy.stats import shapiro, ttest_ind, mannwhitneyu
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
def sensitivity_analysis(dataset, split_ratios, num_trials=30, alpha=0.05):
"""
Perform sensitivity analysis of train/test split sizes on numerical variables.
Parameters:
dataset (pd.DataFrame): Dataset with numerical variables.
split_ratios (list of tuples): List of train/test split ratios (e.g., [(0.5, 0.5), (0.8, 0.2)]).
num_trials (int): Number of random splits to test for each ratio.
alpha (float): Significance level for statistical tests.
Returns:
pd.DataFrame: Results showing the percentage of representative splits for each variable and ratio.
"""
# Initialize result storage
results = {ratio: {col: [] for col in dataset.columns} for ratio in split_ratios}
# Determine normality of each variable
normality = {col: shapiro(dataset[col])[1] > alpha for col in dataset.columns}
# Perform sensitivity analysis
for ratio in split_ratios:
train_size, test_size = ratio
for _ in range(num_trials):
train, test = train_test_split(dataset, train_size=train_size, test_size=test_size, random_state=None)
for col in dataset.columns:
if normality[col]: # Use t-test for normally distributed variables
stat, p_value = ttest_ind(train[col], test[col], equal_var=False)
else: # Use Mann-Whitney U test for non-normally distributed variables
stat, p_value = mannwhitneyu(train[col], test[col])
# Record whether the distributions are statistically similar
results[ratio][col].append(p_value > alpha)
# Summarize results
summary = []
for ratio, variables in results.items():
for col, outcomes in variables.items():
representative_percentage = np.mean(outcomes) * 100
summary.append({"Split Ratio": ratio,
"Variable": col,
"Normal": normality[col],
"Representative %": representative_percentage})
return pd.DataFrame(summary)
def print_results_nicely(results_df):
"""
Prints the results of sensitivity analysis in a formatted way.
Parameters:
results_df (pd.DataFrame): The results dataframe with columns 'Split Ratio', 'Variable',
and 'Representative %'.
"""
grouped = results_df.groupby('Split Ratio')
for split_ratio, group in grouped:
print(f"Split Ratio {split_ratio}:")
for _, row in group.iterrows():
print(f" - {row['Variable']}: (normal: {str(row["Normal"]):5s}) {row['Representative %']:.2f}% representative splits")
print("-" * 40) # Separator for readability
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=4, n_redundant=1, random_state=42)
df = pd.DataFrame(X, columns=[f"Feature_{i}" for i in range(X.shape[1])])
# Define split ratios to test
split_ratios = [(0.5, 0.5), (0.7, 0.3), (0.8, 0.2), (0.9, 0.1)]
# Perform sensitivity analysis
results = sensitivity_analysis(df, split_ratios, num_trials=30)
# Print the formatted results
print_results_nicely(results)
Example Output
Split Ratio (0.5, 0.5):
- Feature_0: (normal: False) 86.67% representative splits
- Feature_1: (normal: True ) 96.67% representative splits
- Feature_2: (normal: False) 80.00% representative splits
- Feature_3: (normal: False) 86.67% representative splits
- Feature_4: (normal: False) 90.00% representative splits
----------------------------------------
Split Ratio (0.7, 0.3):
- Feature_0: (normal: False) 93.33% representative splits
- Feature_1: (normal: True ) 96.67% representative splits
- Feature_2: (normal: False) 100.00% representative splits
- Feature_3: (normal: False) 100.00% representative splits
- Feature_4: (normal: False) 100.00% representative splits
----------------------------------------
Split Ratio (0.8, 0.2):
- Feature_0: (normal: False) 96.67% representative splits
- Feature_1: (normal: True ) 90.00% representative splits
- Feature_2: (normal: False) 93.33% representative splits
- Feature_3: (normal: False) 90.00% representative splits
- Feature_4: (normal: False) 96.67% representative splits
----------------------------------------
Split Ratio (0.9, 0.1):
- Feature_0: (normal: False) 93.33% representative splits
- Feature_1: (normal: True ) 93.33% representative splits
- Feature_2: (normal: False) 93.33% representative splits
- Feature_3: (normal: False) 86.67% representative splits
- Feature_4: (normal: False) 96.67% representative splits
----------------------------------------