Split Sensitivity Analysis By Data Correlation
Procedure
-
Define Split Ratios:
- Select various train/test split ratios to evaluate, such as 70/30, 80/20, and 90/10.
-
Random Splitting:
- For each split ratio, generate multiple random train/test splits to ensure robustness in the results.
-
Correlation Computation:
- Compute correlation coefficients (e.g., Pearson’s for linear relationships or Spearman’s for monotonic relationships) for numerical variables in the entire dataset.
- For example:
- Correlation between each feature and the numerical target variable
- Correlation between each variable and every other variable within each set
- Correlation of each variable in the train vs test set (note, samples have to be the same size)
-
Correlation Comparison:
- For each variable pair, compare the correlations between the train and test splits using a defined similarity threshold (e.g., a difference of less than 0.05 in the correlation values).
- Calculate the percentage of variable pairs where the correlation similarity is maintained within this threshold.
-
Assessment of Split Quality:
- Evaluate each split ratio based on the consistency of correlations:
- Identify the percentage of variable pairs where correlation values remain similar across train and test splits.
- Summarize the results across all trials for each split ratio to determine the most consistent split.
- Evaluate each split ratio based on the consistency of correlations:
-
Reporting Results:
- Report the split ratio with the highest percentage of consistent correlations across trials as the optimal split.
- Include visualizations, if necessary, to demonstrate correlation consistency across splits.
Code Example
Below is a Python function that evaluates train/test split ratios based on the consistency of pairwise correlations of numerical variables of input variables with the target variable across splits.
import numpy as np
import pandas as pd
from scipy.stats import pearsonr, spearmanr, shapiro
from sklearn.model_selection import train_test_split
def correlation_sensitivity_analysis(X, y, split_ratios, num_trials=30, correlation_threshold=0.05, alpha=0.05):
"""
Perform sensitivity analysis on train/test split ratios to evaluate the consistency of numerical variable correlations.
Parameters:
X (np.ndarray): Feature matrix.
y (np.ndarray): Target array.
split_ratios (list of tuples): List of train/test split ratios (e.g., [(0.5, 0.5), (0.8, 0.2)]).
num_trials (int): Number of trials for each ratio.
correlation_threshold (float): Threshold for acceptable correlation differences.
alpha (float): Significance level for normality tests.
Returns:
dict: Summary of correlation consistency results for each split ratio and each variable.
"""
results = {}
num_features = X.shape[1]
# Perform normality test for each feature in the dataset
normality = {}
for i in range(num_features):
p_value = shapiro(X[:, i])[1]
normality[i] = p_value > alpha # True if the variable is normally distributed
for ratio in split_ratios:
train_size, test_size = ratio
ratio_results = {}
for feature_idx in range(num_features):
feature_consistency = []
for _ in range(num_trials):
# Perform random split
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=train_size, test_size=test_size
)
# Choose the appropriate correlation function
if normality[feature_idx]:
corr_func = pearsonr
else:
corr_func = spearmanr
# Compute correlations for the feature with the target
train_corr, _ = corr_func(X_train[:, feature_idx], y_train)
test_corr, _ = corr_func(X_test[:, feature_idx], y_test)
# Check if correlations are within the acceptable threshold
consistent = abs(train_corr - test_corr) <= correlation_threshold
feature_consistency.append(consistent)
# Calculate percentage of consistent trials for the feature
ratio_results[f"Feature_{feature_idx}"] = {
"Normal": normality[feature_idx],
"Consistency (%)": np.mean(feature_consistency) * 100,
}
# Store results for the split ratio
results[ratio] = ratio_results
return results
# Example usage of the function
if __name__ == "__main__":
from sklearn.datasets import make_regression
# Generate a synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=5, noise=0.1, random_state=42)
# Define split ratios to test
split_ratios = [(0.5, 0.5), (0.7, 0.3), (0.8, 0.2), (0.9, 0.1)]
# Perform the sensitivity analysis
results = correlation_sensitivity_analysis(
X, y, split_ratios, num_trials=30, correlation_threshold=0.05, alpha=0.05
)
# Display the results
for ratio, features in results.items():
print(f"Split Ratio {ratio}:")
for feature, metrics in features.items():
print(
f" - {feature}: (normal: {metrics['Normal']}) {metrics['Consistency (%)']:.2f}% consistent trials"
)
print("-" * 40)
Example Output
Split Ratio (0.5, 0.5):
- Feature_0: (normal: True) 60.00% consistent trials
- Feature_1: (normal: True) 90.00% consistent trials
- Feature_2: (normal: True) 66.67% consistent trials
- Feature_3: (normal: True) 63.33% consistent trials
- Feature_4: (normal: True) 66.67% consistent trials
----------------------------------------
Split Ratio (0.7, 0.3):
- Feature_0: (normal: True) 50.00% consistent trials
- Feature_1: (normal: True) 76.67% consistent trials
- Feature_2: (normal: True) 56.67% consistent trials
- Feature_3: (normal: True) 46.67% consistent trials
- Feature_4: (normal: True) 53.33% consistent trials
----------------------------------------
Split Ratio (0.8, 0.2):
- Feature_0: (normal: True) 60.00% consistent trials
- Feature_1: (normal: True) 86.67% consistent trials
- Feature_2: (normal: True) 66.67% consistent trials
- Feature_3: (normal: True) 50.00% consistent trials
- Feature_4: (normal: True) 53.33% consistent trials
----------------------------------------
Split Ratio (0.9, 0.1):
- Feature_0: (normal: True) 40.00% consistent trials
- Feature_1: (normal: True) 80.00% consistent trials
- Feature_2: (normal: True) 40.00% consistent trials
- Feature_3: (normal: True) 30.00% consistent trials
- Feature_4: (normal: True) 46.67% consistent trials
----------------------------------------