Performance Central Tendencies
Procedure
Assumes a given chosen model and performance metric and that we are using k-fold cross-validation for an apples-to-apples performance comparison.
-
Prepare Performance Score Samples
- What to do: Apply the same k-fold cross-validation evaluation procedure to both the train and test datasets.
- Collect a sample of performance scores from the train set.
- Collect a sample of performance scores from the test set.
- What to do: Apply the same k-fold cross-validation evaluation procedure to both the train and test datasets.
-
Choose a Statistical Test
- What to do: Decide whether to use a parametric or non-parametric test based on the distribution of the performance scores:
- Use a t-Test if the scores are approximately normally distributed and have similar variances.
- Use a Mann-Whitney U Test if the scores are not normally distributed or have unequal variances.
- What to do: Decide whether to use a parametric or non-parametric test based on the distribution of the performance scores:
-
Perform the Statistical Test
- What to do: Apply the chosen statistical test to compare the central tendencies of the performance score samples:
- For a t-Test, calculate the p-value to test for equality of means.
- For a Mann-Whitney U Test, calculate the p-value to test for equality of distributions.
- What to do: Apply the chosen statistical test to compare the central tendencies of the performance score samples:
-
Interpret the Results
- What to do: Assess whether there is statistical evidence that the performance scores on the train and test sets differ:
- If the p-value is less than the chosen significance level (e.g., 0.05), reject the null hypothesis and conclude that the train and test performance scores are significantly different.
- If the p-value is greater than the significance level, fail to reject the null hypothesis, indicating no significant difference between the train and test performance scores.
- What to do: Assess whether there is statistical evidence that the performance scores on the train and test sets differ:
-
Report the Findings
- What to do: Summarize the results of the statistical test for stakeholders:
- Clearly state whether the performance scores on the train and test sets are statistically similar or different.
- Include relevant metrics (e.g., p-value, mean/median scores) and implications for model generalizability.
- What to do: Summarize the results of the statistical test for stakeholders:
Code Example
Below is a Python function that compares the performance of a model on train and test datasets using k-fold cross-validation and applies a statistical test to evaluate whether the performances differ.
import numpy as np
from sklearn.model_selection import cross_val_score
from scipy.stats import shapiro, ttest_ind, mannwhitneyu
from sklearn.metrics import make_scorer
def performance_comparison_test(X_train, y_train, X_test, y_test, model, metric, k, alpha=0.05):
"""
Perform a diagnostic test to compare train and test set performances using k-fold CV.
Parameters:
X_train, y_train: Training set features and labels.
X_test, y_test: Test set features and labels.
model: Machine learning model to evaluate.
metric: Scoring function compatible with scikit-learn.
k (int): Number of folds for cross-validation.
alpha (float): P-value threshold for statistical significance (default=0.05).
Returns:
dict: Contains statistical test results, p-value, and interpretation.
"""
# Cross-validation scores for the train set
train_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=make_scorer(metric))
# Cross-validation scores for the test set
test_scores = cross_val_score(model, X_test, y_test, cv=k, scoring=make_scorer(metric))
# Check normality of distributions
_, p_train_normal = shapiro(train_scores)
_, p_test_normal = shapiro(test_scores)
# Determine the appropriate statistical test
if p_train_normal > alpha and p_test_normal > alpha:
# Perform t-test if variances are equal
stat, p_value = ttest_ind(train_scores, test_scores)
test_type = "t-Test"
else:
# Perform Mann-Whitney U Test if variances differ
stat, p_value = mannwhitneyu(train_scores, test_scores, alternative='two-sided')
test_type = "Mann-Whitney U Test"
# Interpret results
significance = "significant" if p_value <= alpha else "not significant"
interpretation = (
"Train and test performances are similar"
if significance == "not significant"
else "Train and test performances differ"
)
return {
"Statistical Test": test_type,
"Statistic": stat,
"P-Value": p_value,
"Significance": significance,
"Interpretation": interpretation
}
# Demo Usage
if __name__ == "__main__":
# Generate synthetic dataset
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# Create synthetic data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define model and metric
chosen_model = RandomForestClassifier(random_state=42)
chosen_metric = accuracy_score
# Run the performance comparison test
results = performance_comparison_test(
X_train, y_train,
X_test, y_test,
chosen_model,
chosen_metric,
k=10
)
# Print results
print("Performance Comparison Test Results:")
for key, value in results.items():
print(f" - {key}: {value}")
Example Output
Performance Comparison Test Results:
- Statistical Test: t-Test
- Statistic: 1.5852581740085454
- P-Value: 0.13031879056799284
- Significance: not significant
- Interpretation: Train and test performances are similar