Split Sensitivity Analysis By Performance Correlation
Procedure
-
Define Split Sizes
- Determine the train/test split sizes to evaluate, such as 50/50, 70/30, 80/20, and 90/10.
-
Choose a Metric and Model
- Select a performance metric (e.g., accuracy, F1 score, or mean squared error) for evaluation.
- Choose a standard, well-performing model (e.g., random forest) to ensure consistent and reliable baseline performance.
-
Perform k-Fold Cross-Validation
- For each split size, conduct k-fold cross-validation to calculate the performance metric for each fold.
- Record the performance metric scores for each fold and split size.
-
Repeat Trials for Each Split Size
- Repeat the k-fold cross-validation process multiple times (e.g., 10–20 iterations) for each split size to account for random variability.
- Gather the sample of performance metric scores for each trial and split size.
-
Assess Correlation of Performance Scores
- Compute correlations (e.g., Pearson’s or Spearman’s) between performance scores for different split sizes across all trials.
- Record the percentage of trials where performance scores show strong correlations (e.g., correlation coefficient > 0.8).
-
Aggregate and Analyze Results
- Summarize the correlation results for each split size, including:
- Percentage of trials with strong correlations.
- Variability in correlations across split sizes.
- Identify the split size(s) that consistently yield strong correlations and stable performance metrics.
- Summarize the correlation results for each split size, including:
-
Report Findings
- Prepare a report that includes:
- Comparison of correlation percentages across split sizes.
- Insights into which split sizes provide the most stable and reliable performance metrics.
- Recommend the optimal split size based on evidence from the analysis.
- Prepare a report that includes:
Code Example
Below is a Python implementation of a diagnostic test that evaluates train/test split ratios by comparing the correlation of model performance metrics derived from k-fold cross-validation on the train and test sets across multiple trials.
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from scipy.stats import spearmanr
def split_sensitivity_analysis(X, y, split_ratios, model, metric, num_trials=30, correlation_threshold=0.75, k=10):
"""
Perform train/test split sensitivity analysis by evaluating correlation of performance scores.
Parameters:
X (np.ndarray): Feature dataset (numerical variables only).
y (np.ndarray): Target array.
split_ratios (list of floats): List of train/test split sizes (e.g., [0.5, 0.7, 0.8, 0.9]).
model (sklearn model): Machine learning model to evaluate.
metric (str): Scoring metric for evaluation.
num_trials (int): Number of random splits to test for each ratio.
correlation_threshold (float): Threshold to consider correlations as "highly correlated".
k (float): Number of folds in k-fold cross-validation.
Returns:
dict: Dictionary with split ratios as keys and percentage of trials with high correlations as values.
"""
results = {}
for split_ratio in split_ratios:
split_key = f"Train:{split_ratio[0]}, Test:{split_ratio[1]}"
high_correlation_count = 0
correlations = []
for _ in range(num_trials):
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=split_ratio[0], test_size=split_ratio[1],
random_state=None, stratify=y
)
# Perform k-fold cross-validation on the train set
train_scores = cross_val_score(model, X_train, y_train, scoring=metric, cv=k)
# Perform k-fold cross-validation on the test set
test_scores = cross_val_score(model, X_test, y_test, scoring=metric, cv=k)
# Compute correlation between train and test scores
corr, _ = spearmanr(train_scores, test_scores)
correlations.append(corr)
if corr >= correlation_threshold:
high_correlation_count += 1
# Calculate percentage of trials with high correlations
results[split_key] = {
"high_correlation_percentage": (high_correlation_count / num_trials) * 100,
"mean_correlation": np.mean(correlations),
"num_trials": num_trials,
}
return results
# Demo Usage
if __name__ == "__main__":
# Generate synthetic dataset
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=100, n_features=10, n_informative=8, random_state=42
)
# Define split ratios to evaluate
split_ratios = [(0.5, 0.5), (0.7, 0.3), (0.8, 0.2)]
# Define model and metric
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
metric = 'accuracy'
# Run the sensitivity analysis
results = split_sensitivity_analysis(
X, y, split_ratios, model, metric, correlation_threshold=0.5
)
# Print Results
for split, details in results.items():
print(f"Split Ratio {split}:")
print(f" - High Correlation Percentage: {details['high_correlation_percentage']:.2f}%")
print(f" - Mean Correlation: {details['mean_correlation']}")
print("-" * 40)
Example Output
Split Ratio Train:0.5, Test:0.5:
- High Correlation Percentage: 10.00%
- Mean Correlation: -0.028202575708872864
----------------------------------------
Split Ratio Train:0.7, Test:0.3:
- High Correlation Percentage: 6.67%
- Mean Correlation: 0.020054506714414368
----------------------------------------
Split Ratio Train:0.8, Test:0.2:
- High Correlation Percentage: 0.00%
- Mean Correlation: -0.07473857510562325
----------------------------------------