Split Sensitivity Analysis By Data Varaince
Procedure
-
Define Split Ratios
- Select split ratios to evaluate, such as 50/50, 70/30, 80/20, and 90/10.
- Determine the number of trials to perform for each split ratio (e.g., 10–20 iterations per ratio).
-
Generate Random Splits
- For each split ratio, create random train/test splits multiple times to simulate variability and assess stability.
-
Perform Variance Tests
- For each split and trial, use variance comparison tests like the F-test or Levene’s test:
- Test each input/target variable between train and test datasets.
- Record whether the variances are statistically similar (e.g., p-value > 0.05).
- For each split and trial, use variance comparison tests like the F-test or Levene’s test:
-
Calculate Consistency Metrics
- For each split ratio and across all trials, calculate:
- Percentage of variables with similar variances in train and test datasets.
- Trials where all variables have matching variances.
- For each split ratio and across all trials, calculate:
-
Aggregate Results Across Splits
- Summarize results for each split ratio by:
- Averaging the percentage of variables with similar variances across trials.
- Counting the number of trials where all variables match variances.
- Summarize results for each split ratio by:
-
Report Findings
- Present findings in a clear, tabular, or graphical format:
- Highlight split ratios with consistently high variance similarity across variables.
- Identify any split ratios where none or only a few variables maintain similar variances.
- Provide recommendations for optimal split ratios based on observed trends and stability.
- Present findings in a clear, tabular, or graphical format:
Code Example
Below is a Python function that evaluates train/test split ratios by comparing the variance consistency of numerical variables across splits.
import numpy as np
from sklearn.model_selection import train_test_split
from scipy.stats import levene
def split_variance_sensitivity_analysis(X, y, split_ratios, num_trials=30, alpha=0.05):
"""
Perform variance sensitivity analysis for train/test splits across numerical variables.
Parameters:
X (np.ndarray): Feature dataset (numerical variables only).
y (np.ndarray): Target array (not used in this function but included for completeness).
split_ratios (list of tuples): List of train/test split ratios (e.g., [(0.7, 0.3), (0.8, 0.2)]).
num_trials (int): Number of random splits to test for each ratio.
alpha (float): Significance level for Levene’s test.
Returns:
dict: A dictionary with split ratios as keys and consistency percentages for each variable.
"""
n_features = X.shape[1]
results = {}
for train_ratio, test_ratio in split_ratios:
ratio_key = f"Train:{train_ratio}, Test:{test_ratio}"
results[ratio_key] = {f"Variable_{i}": [] for i in range(n_features)}
for _ in range(num_trials):
X_train, X_test, _, _ = train_test_split(X, y, train_size=train_ratio, test_size=test_ratio)
for i in range(n_features):
stat, p_value = levene(X_train[:, i], X_test[:, i])
results[ratio_key][f"Variable_{i}"].append(p_value > alpha)
for var in results[ratio_key]:
results[ratio_key][var] = np.mean(results[ratio_key][var]) * 100 # Percentage of consistent splits
return results
# Demo Usage
if __name__ == "__main__":
# Generate synthetic dataset
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1000, n_features=5, noise=0.1, random_state=42)
# Define split ratios to evaluate
split_ratios = [(0.5, 0.5), (0.7, 0.3), (0.8, 0.2), (0.9, 0.1)]
# Run the sensitivity analysis
results = split_variance_sensitivity_analysis(X, y, split_ratios, num_trials=30, alpha=0.05)
# Print Results
for split, variables in results.items():
print(f"Split Ratio {split}:")
for var, percentage in variables.items():
print(f" - {var}: {percentage:.2f}% consistent splits")
print("-" * 40)
Example Output
Split Ratio Train:0.5, Test:0.5:
- Variable_0: 86.67% consistent splits
- Variable_1: 100.00% consistent splits
- Variable_2: 96.67% consistent splits
- Variable_3: 93.33% consistent splits
- Variable_4: 100.00% consistent splits
----------------------------------------
Split Ratio Train:0.7, Test:0.3:
- Variable_0: 96.67% consistent splits
- Variable_1: 96.67% consistent splits
- Variable_2: 90.00% consistent splits
- Variable_3: 93.33% consistent splits
- Variable_4: 93.33% consistent splits
----------------------------------------
Split Ratio Train:0.8, Test:0.2:
- Variable_0: 93.33% consistent splits
- Variable_1: 96.67% consistent splits
- Variable_2: 96.67% consistent splits
- Variable_3: 100.00% consistent splits
- Variable_4: 86.67% consistent splits
----------------------------------------
Split Ratio Train:0.9, Test:0.1:
- Variable_0: 96.67% consistent splits
- Variable_1: 100.00% consistent splits
- Variable_2: 96.67% consistent splits
- Variable_3: 96.67% consistent splits
- Variable_4: 96.67% consistent splits
----------------------------------------