Split Sensitivity Analysis By Data Distribution
Procedure
-
Define Split Ratios
- Identify train/test split ratios to evaluate, such as 50/50, 70/30, 80/20, and 90/10.
-
Perform Random Splits
- For each split ratio, generate multiple random train/test splits (e.g., 10–20 iterations per ratio) to account for variability in results.
-
Compare Variable Distributions
- Use distribution comparison tests (e.g., Kolmogorov-Smirnov Test or Anderson-Darling Test) to evaluate the similarity of sample distributions for each variable between the train and test sets in each split.
-
Record Consistency Metrics
- For each split ratio and iteration, determine whether the distributions for each variable in the train and test sets are statistically similar (e.g., p-value > 0.05).
- Calculate the percentage of variables that maintain similar distributions for each split ratio.
-
Aggregate Results Across Iterations
- Compute the average percentage of variables with similar distributions for each split ratio across all iterations.
- Identify split ratios where the results are consistently stable and have a high percentage of matching distributions.
-
Report Findings
- Summarize the distribution comparison results for each split ratio, including:
- Average percentage of variables with similar distributions.
- Number of iterations where all variables had matching distributions.
- Highlight the split ratio(s) with the best performance based on these metrics.
- Summarize the distribution comparison results for each split ratio, including:
Code Example
Below is a Python implementation of the diagnostic test that evaluates train/test split ratios by comparing the distributions of numerical variables between the train and test sets.
import numpy as np
import pandas as pd
from scipy.stats import ks_2samp
from sklearn.model_selection import train_test_split
def split_sensitivity_analysis(X, y, split_ratios, num_trials=30, alpha=0.05):
"""
Perform sensitivity analysis of train/test split sizes by comparing variable distributions.
Parameters:
X (np.ndarray): Feature dataset (numerical variables only).
y (np.ndarray): Target array (not used in this function but included for completeness).
split_ratios (list of tuples): List of train/test split ratios (e.g., [(0.5, 0.5), (0.8, 0.2)]).
num_trials (int): Number of random splits to test for each ratio.
alpha (float): Significance level for statistical tests.
Returns:
dict: A dictionary with split ratios as keys and results for each variable as details.
"""
n_features = X.shape[1]
results = {}
for train_ratio, test_ratio in split_ratios:
ratio_key = f"Train:{train_ratio}, Test:{test_ratio}"
results[ratio_key] = {f"Variable_{i}": [] for i in range(n_features)}
for _ in range(num_trials):
X_train, X_test, _, _ = train_test_split(X, y, train_size=train_ratio, test_size=test_ratio)
for i in range(n_features):
stat, p_value = ks_2samp(X_train[:, i], X_test[:, i])
results[ratio_key][f"Variable_{i}"].append(p_value > alpha)
for var in results[ratio_key]:
results[ratio_key][var] = np.mean(results[ratio_key][var]) * 100 # Percentage of consistent trials
return results
# Demo Usage
if __name__ == "__main__":
# Generate synthetic dataset
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1000, n_features=5, noise=0.1, random_state=42)
# Define split ratios to evaluate
split_ratios = [(0.5, 0.5), (0.7, 0.3), (0.8, 0.2), (0.9, 0.1)]
# Run the sensitivity analysis
results = split_sensitivity_analysis(X, y, split_ratios, num_trials=30, alpha=0.05)
# Print Results
for split, variables in results.items():
print(f"Split Ratio {split}:")
for var, percentage in variables.items():
print(f" - {var}: {percentage:.2f}% consistent splits")
print("-" * 40)Example Output
Split Ratio Train:0.5, Test:0.5:
- Variable_0: 90.00% consistent splits
- Variable_1: 93.33% consistent splits
- Variable_2: 93.33% consistent splits
- Variable_3: 93.33% consistent splits
- Variable_4: 96.67% consistent splits
----------------------------------------
Split Ratio Train:0.7, Test:0.3:
- Variable_0: 100.00% consistent splits
- Variable_1: 100.00% consistent splits
- Variable_2: 96.67% consistent splits
- Variable_3: 100.00% consistent splits
- Variable_4: 96.67% consistent splits
----------------------------------------
Split Ratio Train:0.8, Test:0.2:
- Variable_0: 96.67% consistent splits
- Variable_1: 96.67% consistent splits
- Variable_2: 96.67% consistent splits
- Variable_3: 90.00% consistent splits
- Variable_4: 90.00% consistent splits
----------------------------------------
Split Ratio Train:0.9, Test:0.1:
- Variable_0: 93.33% consistent splits
- Variable_1: 96.67% consistent splits
- Variable_2: 90.00% consistent splits
- Variable_3: 93.33% consistent splits
- Variable_4: 96.67% consistent splits
----------------------------------------