Split Sensitivity Analysis By Data Divergence
Procedure
-
Define Split Ratios
- Identify train/test split ratios to evaluate, such as 50/50, 70/30, 80/20, and 90/10.
-
Prepare Data for Divergence Analysis
- Normalize all variables in the dataset to ensure they are on the same scale, as divergence measures often require consistent scaling.
- Handle missing data and inconsistencies to avoid issues during calculations.
-
Perform Random Splits
- For each split ratio, generate multiple random train/test splits (e.g., 10–20 iterations per ratio) to account for variability in results.
-
Calculate Divergence Measures
- For each split and each variable:
- Compute a divergence measure such as KL-divergence or JS-divergence to assess how the distribution of the variable differs between the train and test sets.
- Record the divergence scores for all variables in each split.
- For each split and each variable:
-
Analyze Low-Divergence Outcomes
- Determine the percentage of variables in each split with low divergence scores (e.g., below a pre-defined threshold).
- Record the percentage for each iteration and average the results for each split ratio.
-
Aggregate and Compare Results
- Aggregate the results across all iterations for each split ratio.
- Compare the average percentage of low-divergence variables between split ratios to identify trends or patterns.
-
Report Findings
- Summarize the percentage of variables with low divergence for each split ratio and iteration.
- Highlight the split ratio(s) that consistently show the lowest divergence and the highest percentage of variables with similar distributions.
- Provide visualizations (e.g., bar charts or tables) to aid in interpretation of the results.
Code Example
The following Python function implements a diagnostic test to evaluate train/test split ratios by calculating divergence between the distributions of numerical variables in the train and test datasets, using the Jensen-Shannon (JS) divergence measure.
import numpy as np
from sklearn.model_selection import train_test_split
from scipy.spatial.distance import jensenshannon
from sklearn.preprocessing import MinMaxScaler
def evaluate_split_divergence(X, y, split_ratios, num_trials=30, divergence_threshold=0.1):
"""
Evaluate train/test split ratios by calculating the Jensen-Shannon divergence
between distributions of numerical variables.
Parameters:
X (np.ndarray): Feature dataset (numerical variables only).
y (np.ndarray): Target variable array (not used in the analysis but included for completeness).
split_ratios (list of tuples): Train/test split ratios to evaluate (e.g., [(0.5, 0.5), (0.7, 0.3)]).
num_trials (int): Number of random splits to test for each ratio.
divergence_threshold (float): Threshold below which divergence is considered low.
Returns:
dict: A dictionary with split ratios as keys and low-divergence percentages for each variable.
"""
n_features = X.shape[1]
results = {}
# Normalize the dataset to [0, 1] range for consistency
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
for train_ratio, test_ratio in split_ratios:
ratio_key = f"Train:{train_ratio}, Test:{test_ratio}"
results[ratio_key] = {f"Variable_{i}": [] for i in range(n_features)}
for _ in range(num_trials):
# Create train/test split
X_train, X_test, _, _ = train_test_split(X, y, train_size=train_ratio, test_size=test_ratio)
# Compare distributions for each variable using JS divergence
for i in range(n_features):
# Compute histograms for train and test distributions
train_hist, _ = np.histogram(X_train[:, i], bins=20, range=(0, 1), density=True)
test_hist, _ = np.histogram(X_test[:, i], bins=20, range=(0, 1), density=True)
# Add small value to avoid division by zero
train_hist += 1e-8
test_hist += 1e-8
# Normalize histograms to probabilities
train_hist /= np.sum(train_hist)
test_hist /= np.sum(test_hist)
# Calculate JS divergence
js_div = jensenshannon(train_hist, test_hist)
# Record whether divergence is below threshold
results[ratio_key][f"Variable_{i}"].append(js_div < divergence_threshold)
# Calculate percentage of low-divergence splits per variable
for var in results[ratio_key]:
results[ratio_key][var] = np.mean(results[ratio_key][var]) * 100
return results
# Demo Usage
if __name__ == "__main__":
# Generate a synthetic dataset
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1000, n_features=5, noise=0.1, random_state=42)
# Define train/test split ratios to evaluate
split_ratios = [(0.5, 0.5), (0.7, 0.3), (0.8, 0.2), (0.9, 0.1)]
# Perform divergence analysis
results = evaluate_split_divergence(X, y, split_ratios, num_trials=30, divergence_threshold=0.1)
# Print results
for split, variables in results.items():
print(f"Split Ratio {split}:")
for var, percentage in variables.items():
print(f" - {var}: {percentage:.2f}% low-divergence splits")
print("-" * 40)
Example Output
Split Ratio Train:0.5, Test:0.5:
- Variable_0: 26.67% low-divergence splits
- Variable_1: 43.33% low-divergence splits
- Variable_2: 66.67% low-divergence splits
- Variable_3: 76.67% low-divergence splits
- Variable_4: 43.33% low-divergence splits
----------------------------------------
Split Ratio Train:0.7, Test:0.3:
- Variable_0: 13.33% low-divergence splits
- Variable_1: 13.33% low-divergence splits
- Variable_2: 36.67% low-divergence splits
- Variable_3: 53.33% low-divergence splits
- Variable_4: 43.33% low-divergence splits
----------------------------------------
Split Ratio Train:0.8, Test:0.2:
- Variable_0: 0.00% low-divergence splits
- Variable_1: 10.00% low-divergence splits
- Variable_2: 6.67% low-divergence splits
- Variable_3: 6.67% low-divergence splits
- Variable_4: 6.67% low-divergence splits
----------------------------------------
Split Ratio Train:0.9, Test:0.1:
- Variable_0: 0.00% low-divergence splits
- Variable_1: 0.00% low-divergence splits
- Variable_2: 0.00% low-divergence splits
- Variable_3: 0.00% low-divergence splits
- Variable_4: 0.00% low-divergence splits
----------------------------------------