Split Sensitivity Analysis By Performance Distribution
Procedure
-
Define Split Ratios and Metric
- Identify train/test split ratios to evaluate, such as 50/50, 70/30, 80/20, and 90/10.
- Select a performance metric (e.g., accuracy, precision, recall, F1 score) for evaluation.
-
Perform Model Evaluation with K-Fold Cross-Validation
- For each split ratio, divide the train set using k-fold cross-validation (e.g., k=5).
- Train a standard model, such as logistic regression, on the training folds and evaluate on the validation fold.
- Record performance metrics for each fold.
-
Evaluate Test Set Performance
- Train the model on the entire train set for each split ratio.
- Test the model on the reserved test set and record the performance metric.
-
Compare Performance Distributions
- Use statistical tests (e.g., Kolmogorov-Smirnov Test or Anderson-Darling Test) to compare the distribution of k-fold cross-validation performance metrics with the test set performance metric for each split ratio.
-
Analyze Correlation and Stability
- Count the number of cases where the train (k-fold) and test set distributions are statistically equivalent (e.g., p-value > 0.05).
- Repeat the entire process multiple times (e.g., 10–20 trials per split ratio) to account for variability.
-
Aggregate and Summarize Results
- Calculate the percentage of trials where the train and test set distributions are statistically equivalent for each split ratio.
- Identify the split ratio(s) that consistently show a high correlation between train and test performance metrics.
-
Report Findings
- Summarize the performance comparison results for each split ratio, including:
- Average percentage of trials with equivalent distributions.
- Number of trials where performance distributions are strongly correlated.
- Highlight the split ratio(s) that provide stable and reliable performance evaluation.
- Summarize the performance comparison results for each split ratio, including:
Code Example
Below is a Python function that performs a diagnostic test to evaluate train/test split ratios by comparing model performance distributions using k-fold cross-validation and a chosen evaluation metric.
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import make_scorer
from scipy.stats import ks_2samp
def model_split_sensitivity_analysis(X, y, split_ratios, model, metric, k=10, num_trials=30, alpha=0.05):
"""
Perform sensitivity analysis of train/test split sizes by comparing model performance distributions.
Parameters:
X (np.ndarray): Feature dataset (numerical variables only).
y (np.ndarray): Target array.
split_ratios (list of tuples): List of train/test split ratios (e.g., [(0.5, 0.5), (0.8, 0.2)]).
model: A scikit-learn compatible model instance.
metric: A scikit-learn compatible scoring function (e.g., accuracy_score).
k (int): Number of folds for cross-validation.
num_trials (int): Number of random splits to test for each ratio.
alpha (float): Significance level for the Kolmogorov-Smirnov test.
Returns:
dict: A dictionary with split ratios as keys and summary results of consistency for each trial.
"""
scorer = make_scorer(metric)
results = {}
for train_ratio, test_ratio in split_ratios:
ratio_key = f"Train:{train_ratio}, Test:{test_ratio}"
results[ratio_key] = []
for _ in range(num_trials):
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_ratio, test_size=test_ratio)
# Cross-validation performance
cv_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=scorer)
# Test set performance
model.fit(X_train, y_train)
test_score = metric(y_test, model.predict(X_test))
# Distribution comparison
stat, p_value = ks_2samp(cv_scores, [test_score])
results[ratio_key].append(p_value > alpha) # Consistency if p-value > alpha
# Summarize results for the split ratio
results[ratio_key] = {
"Consistent Trials (%)": np.mean(results[ratio_key]) * 100
}
return results
# Demo Usage
if __name__ == "__main__":
# Generate synthetic dataset
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
X, y = make_classification(n_samples=1000, n_features=5, random_state=42)
# Define split ratios to evaluate
split_ratios = [(0.5, 0.5), (0.7, 0.3), (0.8, 0.2), (0.9, 0.1)]
# Initialize model and metric
model = LogisticRegression()
metric = accuracy_score
# Run the sensitivity analysis
results = model_split_sensitivity_analysis(X, y, split_ratios, model, metric, k=10, num_trials=30, alpha=0.01)
# Print Results
for split, details in results.items():
print(f"Split Ratio {split}:")
for key, value in details.items():
print(f" - {key}: {value:.2f}%")
print("-" * 40)
Example Output
Split Ratio Train:0.5, Test:0.5:
- Consistent Trials (%): 100.00%
----------------------------------------
Split Ratio Train:0.7, Test:0.3:
- Consistent Trials (%): 100.00%
----------------------------------------
Split Ratio Train:0.8, Test:0.2:
- Consistent Trials (%): 100.00%
----------------------------------------
Split Ratio Train:0.9, Test:0.1:
- Consistent Trials (%): 100.00%
----------------------------------------