Performance Variance
Procedure
Assumes a given chosen model and performance metric and that we are using k-fold cross-validation for an apples-to-apples performance comparison.
-
Evaluate the Chosen Model on Train and Test Sets
- What to do: Use the chosen evaluation procedure (e.g., k-fold cross-validation) to gather performance scores:
- Apply k-fold cross-validation to the train set and collect a sample of performance scores.
- Apply the same k-fold cross-validation procedure to the test set and collect a separate sample of performance scores.
- What to do: Use the chosen evaluation procedure (e.g., k-fold cross-validation) to gather performance scores:
-
Choose the Variance Testing Method
- What to do: Select a statistical test to compare the variances of the two samples:
- Use an F-test if the performance score samples are normally distributed.
- Use Levene’s test or Brown-Forsythe test if the samples may not follow normal distributions or have unequal sizes.
- What to do: Select a statistical test to compare the variances of the two samples:
-
Verify Assumptions for the Test
- What to do: Ensure the chosen test’s assumptions are met:
- For the F-test, assess normality of the performance score samples using a normality test (e.g., Shapiro-Wilk or Kolmogorov-Smirnov).
- For Levene’s or Brown-Forsythe tests, confirm the independence of the samples and review sample size requirements.
- What to do: Ensure the chosen test’s assumptions are met:
-
Perform the Variance Test
- What to do: Compare the variances of the train and test performance score samples:
- Input the two samples into the selected test.
- Record the test statistic and p-value.
- What to do: Compare the variances of the train and test performance score samples:
-
Interpret the Results
- What to do: Analyze the output of the variance test:
- If the p-value is less than the significance level (commonly 0.05), conclude that the variances are significantly different.
- If the p-value is greater than the significance level, conclude that there is no significant difference in variances.
- What to do: Analyze the output of the variance test:
-
Report the Results
- What to do: Summarize the findings:
- Clearly state the test used, the test statistic, and the p-value.
- Explain the practical implications for the model’s performance stability across the train and test sets (e.g., variance inconsistency might suggest overfitting or instability).
- What to do: Summarize the findings:
Code Example
This function tests whether the variances of performance scores on the train and test sets are statistically different. It automatically selects the appropriate test based on the normality of the score distributions, using a chosen model, scoring metric, and k-fold cross-validation.
import numpy as np
from sklearn.model_selection import cross_val_score
from scipy.stats import shapiro, levene, f_oneway
from sklearn.metrics import make_scorer
def variance_diagnostic(X_train, y_train, X_test, y_test, model, metric, k, alpha=0.05):
"""
Perform a diagnostic test to compare variances of train and test CV scores.
Parameters:
X_train, y_train: Train set X and y elements.
X_test, y_test: Test set X and y elements.
model: The machine learning model to evaluate.
metric: Performance metric function compatible with scikit-learn.
k (int): Number of folds for cross-validation.
alpha (float): P-value cut-off for statistical tests (default=0.05).
Returns:
dict: Dictionary containing the test name, statistic, p-value, significance, and interpretation.
"""
# Cross-validation scores for the train set
cv_train_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=make_scorer(metric))
# Cross-validation scores for the test set
cv_test_scores = cross_val_score(model, X_test, y_test, cv=k, scoring=make_scorer(metric))
# Check normality of the score distributions
_, p_train_normal = shapiro(cv_train_scores)
_, p_test_normal = shapiro(cv_test_scores)
if p_train_normal > alpha and p_test_normal > alpha:
# Both distributions are normal, use F-test
stat, p_value = f_oneway(cv_train_scores, cv_test_scores)
test_name = "F-Test"
else:
# Use Levene's test if normality is not met
stat, p_value = levene(cv_train_scores, cv_test_scores)
test_name = "Levene's Test"
# Interpret results
significance = "significant" if p_value <= alpha else "not significant"
interpretation = "Variances are different" if p_value <= alpha else "No evidence of variance difference"
return {
"Test Name": test_name,
"Statistic": stat,
"P-Value": p_value,
"Significance": significance,
"Interpretation": interpretation
}
# Demo Usage
if __name__ == "__main__":
# Generate synthetic dataset
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = X[:800], X[800:], y[:800], y[800:]
# Define model and metric
chosen_model = RandomForestClassifier(random_state=42)
chosen_metric = accuracy_score
# Run the variance diagnostic test
results = variance_diagnostic(
X_train, y_train,
X_test, y_test,
chosen_model,
chosen_metric,
10
)
# Print Results
print("Variance Diagnostic Test Results:")
for key, value in results.items():
print(f" - {key}: {value}")Example Output
Variance Diagnostic Test Results:
- Test Name: F-Test
- Statistic: 1.638202247191011
- P-Value: 0.2168230881706279
- Significance: not significant
- Interpretation: No evidence of variance difference