Split Sensitivity Analysis By Performance Central Tendency
We can perform a sensitivity analysis of train/test split sizes using comparisons of performance distributions.
Evaluate the quality of different train/test split ratios based on the consistency of model performance metrics, ensuring that splits represent the dataset equally well from a modeling perspective.
Procedure:
-
Define Split Ratios:
- Specify a range of train/test split ratios to evaluate (e.g., 50/50, 70/30, 80/20, 90/10).
-
Select a Model:
- Use a standard baseline model (e.g., a dummy model, random forest, or logistic regression) to ensure consistent results.
- Ensure the model can handle the dataset’s characteristics (classification/regression, numerical/categorical).
-
Generate Random Splits:
- For each split ratio, perform multiple random train/test splits to account for variability.
-
Cross-Validation on Subsets:
- Perform k-fold cross-validation (CV) on the training set of each split to obtain a distribution of performance metrics (e.g., accuracy, RMSE, F1-score).
- Independently, perform k-fold CV on the test set (treating it as a surrogate for unseen data) to obtain a comparable performance distribution.
-
Compare Performance Metric Distributions:
- For each split ratio and each performance metric:
- Use statistical tests to compare the central tendencies of the train and test CV performance distributions.
- Parametric test: t-test if the performance metric distribution is normal.
- Non-parametric test: Mann-Whitney U test if the metric distribution is non-normal.
- Use statistical tests to compare the central tendencies of the train and test CV performance distributions.
- Record whether the train and test performance distributions are statistically similar for each metric.
- For each split ratio and each performance metric:
-
Aggregate Results:
- For each split ratio, compute:
- The percentage of splits where train and test distributions are statistically similar.
- The percentage similarity for each performance metric across splits.
- For each split ratio, compute:
-
Evaluate Split Quality:
- “Good” splits yield train and test CV performance distributions that are consistently similar across metrics and splits.
- “Bad” splits show significant differences between the train and test performance distributions, indicating that the split is not representative.
Code Example
Below is a code example that performs sensitivity analysis based on model performance metrics.
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import make_scorer
from scipy.stats import mannwhitneyu
def split_ratio_sensitivity_test(X, y, split_ratios, model, metric, k=5, num_trials=20, alpha=0.05):
"""
Evaluates the impact of train/test split ratios on model performance by applying k-fold cross-validation
to both train and test sets and comparing their central tendencies using a non-parametric test.
Parameters:
X (np.ndarray): Feature dataset (numerical variables only).
y (np.ndarray): Target array.
split_ratios (list of tuples): Train/test split ratios (e.g., [(0.7, 0.3), (0.8, 0.2)]).
model: A scikit-learn compatible model instance.
metric: A callable metric function (e.g., sklearn.metrics.accuracy_score).
k (int): Number of folds for cross-validation.
num_trials (int): Number of random trials for each split ratio.
alpha (float): Significance level for the statistical test (default is 0.05).
Returns:
dict: A dictionary summarizing the percentage of trials with equivalent central tendencies for each split ratio.
"""
results = {}
scorer = make_scorer(metric)
for train_ratio, test_ratio in split_ratios:
ratio_key = f"Train:{train_ratio}, Test:{test_ratio}"
results[ratio_key] = []
for _ in range(num_trials):
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=train_ratio, test_size=test_ratio, random_state=None
)
# Perform k-fold cross-validation on the train set
train_cv_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=scorer)
# Perform k-fold cross-validation on the test set
test_cv_scores = cross_val_score(model, X_test, y_test, cv=k, scoring=scorer)
# Compare central tendencies using Mann-Whitney U test
_, p_value = mannwhitneyu(train_cv_scores, test_cv_scores, alternative='two-sided')
# Evaluate consistency based on p-value
results[ratio_key].append(p_value > alpha) # Consistent if p-value > alpha
# Summarize results for this split ratio
results[ratio_key] = {
"Consistent Trials (%)": np.mean(results[ratio_key]) * 100
}
return results
# Demo Usage
if __name__ == "__main__":
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
# Define split ratios to evaluate
split_ratios = [(0.5, 0.5), (0.7, 0.3), (0.8, 0.2), (0.9, 0.1)]
# Initialize model and metric
model = LogisticRegression()
metric = accuracy_score
# Run the sensitivity analysis
results = split_ratio_sensitivity_test(
X, y, split_ratios, model, metric, k=10, num_trials=30, alpha=0.01
)
# Print results
for split, details in results.items():
print(f"Split Ratio {split}:")
for key, value in details.items():
print(f" - {key}: {value:.2f}%")
print("-" * 40)
Example Output:
Split Ratio Train:0.5, Test:0.5:
- Consistent Trials (%): 100.00%
----------------------------------------
Split Ratio Train:0.7, Test:0.3:
- Consistent Trials (%): 100.00%
----------------------------------------
Split Ratio Train:0.8, Test:0.2:
- Consistent Trials (%): 100.00%
----------------------------------------
Split Ratio Train:0.9, Test:0.1:
- Consistent Trials (%): 90.00%
----------------------------------------