Split Sensitivity Analysis By Performance Divergence
Procedure
-
Define Split Sizes and Divergence Metric
- Identify train/test split sizes to evaluate, such as 50/50, 70/30, 80/20, and 90/10.
- Select a divergence metric, such as KL-divergence or JS-divergence, to measure the difference between score distributions.
- Ensure input data is normalized before performing divergence calculations.
-
Perform Model Evaluation with K-Fold Cross-Validation
- For each split size, divide the training set using k-fold cross-validation (e.g., k=5).
- Train a naive or standard model, such as logistic regression, on the training folds and evaluate it on the validation fold.
- Record the model’s performance scores for each fold.
-
Evaluate Test Set Performance
- Train the model on the entire training set for each split size.
- Test the model on the reserved test set and record the performance scores.
-
Calculate Divergence Scores
- Compute the divergence between the k-fold cross-validation performance score distribution and the test set performance score distribution for each split size.
- Record divergence scores for each split size.
-
Count Cases of Low Divergence
- Define a threshold for what constitutes “low divergence.”
- Count the number of cases for each split size where the divergence score is below the threshold.
-
Repeat and Aggregate Results
- Repeat the entire process multiple times (e.g., 10–20 trials per split size) to account for variability.
- Calculate the percentage of trials where each split size demonstrates low divergence.
-
Analyze and Summarize Findings
- Compare the percentages of trials with low divergence across different split sizes.
- Identify split size(s) that consistently yield low divergence scores, indicating stable and reliable performance evaluation.
-
Report Findings
- Summarize the comparison of divergence scores for each split size, including:
- Average percentage of trials with low divergence.
- Total number of cases with low divergence across trials.
- Recommend the split size(s) that demonstrate the best balance of performance stability and distribution alignment.
- Summarize the comparison of divergence scores for each split size, including:
Code Example
This Python function evaluates train/test split ratios by counting the number of cases where the divergence between cross-validation (CV) scores on the training set and test set is low.
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import make_scorer
from scipy.stats import entropy
def calculate_divergence(cv_scores_train, cv_scores_test, divergence_type="kl"):
"""
Calculate divergence between two score distributions.
Parameters:
cv_scores_train (array-like): Cross-validation scores on the training set.
cv_scores_test (array-like): Cross-validation scores on the test set.
divergence_type (str): Type of divergence to calculate ("kl" or "js").
Returns:
float: Divergence value.
"""
# Normalize distributions
train_dist = np.histogram(cv_scores_train, bins=20, range=(0, 1), density=True)[0]
test_dist = np.histogram(cv_scores_test, bins=20, range=(0, 1), density=True)[0]
# Add small values to avoid zero-division
train_dist += 1e-10
test_dist += 1e-10
if divergence_type == "kl":
return entropy(train_dist, test_dist)
elif divergence_type == "js":
avg_dist = 0.5 * (train_dist + test_dist)
return 0.5 * (entropy(train_dist, avg_dist) + entropy(test_dist, avg_dist))
else:
raise ValueError("Invalid divergence_type. Choose 'kl' or 'js'.")
def evaluate_split_ratios_with_low_divergence(X, y, split_ratios, model, metric, k=10, num_trials=30, threshold=0.1, divergence_type="js"):
"""
Evaluate train/test split ratios by counting cases of low divergence between CV scores.
Parameters:
X (np.ndarray): Feature dataset (numerical variables only).
y (np.ndarray): Target array.
split_ratios (list of tuples): List of train/test split ratios to evaluate.
model: A scikit-learn compatible model instance.
metric: A scikit-learn compatible scoring function (e.g., accuracy_score).
k (int): Number of folds for cross-validation.
num_trials (int): Number of random trials per split ratio.
threshold (float): Threshold for divergence to be considered "low."
divergence_type (str): Type of divergence to calculate ("kl" or "js").
Returns:
dict: A dictionary with results for each split ratio, including the percentage of low-divergence trials.
"""
scorer = make_scorer(metric)
results = {}
for train_ratio, test_ratio in split_ratios:
split_key = f"Train:{train_ratio}, Test:{test_ratio}"
low_divergence_count = 0
for _ in range(num_trials):
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=train_ratio, test_size=test_ratio, random_state=None
)
# Cross-validation scores
cv_scores_train = cross_val_score(model, X_train, y_train, cv=k, scoring=scorer)
cv_scores_test = cross_val_score(model, X_test, y_test, cv=k, scoring=scorer)
# Calculate divergence
divergence = calculate_divergence(cv_scores_train, cv_scores_test, divergence_type)
# Count low divergence cases
if divergence <= threshold:
low_divergence_count += 1
# Aggregate results
results[split_key] = {
"Low Divergence Cases (%)": (low_divergence_count / num_trials) * 100,
}
return results
# Demo Usage
if __name__ == "__main__":
# Generate synthetic dataset
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
X, y = make_classification(n_samples=500, n_features=10, random_state=42)
# Define train/test split ratios
split_ratios = [(0.5, 0.5), (0.7, 0.3), (0.8, 0.2), (0.9, 0.1)]
# Define model and metric
model = LogisticRegression()
metric = accuracy_score
# Run the diagnostic test
results = evaluate_split_ratios_with_low_divergence(
X, y, split_ratios, model, metric, k=5, num_trials=30, threshold=0.1, divergence_type="js"
)
# Print results
for split, details in results.items():
print(f"Split Ratio {split}:")
for key, value in details.items():
print(f" - {key}: {value}")
print("-" * 40)
Example Output
Split Ratio Train:0.5, Test:0.5:
- Low Divergence Cases (%): 33.33333333333333
----------------------------------------
Split Ratio Train:0.7, Test:0.3:
- Low Divergence Cases (%): 26.666666666666668
----------------------------------------
Split Ratio Train:0.8, Test:0.2:
- Low Divergence Cases (%): 13.333333333333334
----------------------------------------
Split Ratio Train:0.9, Test:0.1:
- Low Divergence Cases (%): 3.3333333333333335
----------------------------------------