Performance Distributions
Procedure
Assumes a given chosen model and performance metric and that we are using k-fold cross-validation for an apples-to-apples performance comparison.
-
Prepare Samples of Performance Scores
- What to do: Collect performance scores using the same evaluation procedure for both train and test sets.
- Apply k-fold cross-validation to the train set to obtain a sample of performance scores.
- Apply k-fold cross-validation to the test set to obtain a comparable sample of performance scores.
- What to do: Collect performance scores using the same evaluation procedure for both train and test sets.
-
Choose a Statistical Test
- What to do: Decide on the appropriate statistical test for comparing distributions of the train and test performance scores.
- Use the Kolmogorov-Smirnov Test to assess if the two distributions differ significantly.
- Alternatively, use the Anderson-Darling Test for a more sensitive comparison if the sample sizes are small.
- What to do: Decide on the appropriate statistical test for comparing distributions of the train and test performance scores.
-
Perform the Statistical Test
- What to do: Execute the chosen test to compare the distributions of performance scores.
- Input the train and test performance score samples into the selected statistical test.
- Record the test statistic and p-value from the result.
- What to do: Execute the chosen test to compare the distributions of performance scores.
-
Interpret the Results
- What to do: Assess whether the train and test score distributions are statistically similar or different.
- If the p-value is below the chosen significance level (e.g., 0.05), conclude that the distributions are different.
- If the p-value is above the significance level, conclude that there is no significant difference in the distributions.
- What to do: Assess whether the train and test score distributions are statistically similar or different.
-
Report the Findings
- What to do: Summarize the results of the statistical test for stakeholders.
- Clearly state whether the train and test distributions are statistically similar or different.
- Discuss potential implications for model generalization or performance issues, if applicable.
- What to do: Summarize the results of the statistical test for stakeholders.
Code Example
Below is a Python function that compares the distributions of cross-validation performance scores on the train and test sets using statistical tests to assess similarity.
import numpy as np
from sklearn.model_selection import cross_val_score
from scipy.stats import ks_2samp
from sklearn.metrics import make_scorer
def distribution_comparison_diagnostic(X_train, y_train, X_test, y_test, model, metric, k, alpha=0.05):
"""
Compare the distributions of cross-validation performance scores on the train and test sets.
Parameters:
X_train, y_train: Train set features and target.
X_test, y_test: Test set features and target.
model: Machine learning model to evaluate.
metric: Performance metric function compatible with scikit-learn.
k (int): Number of folds for cross-validation.
alpha (float): Significance level for statistical tests (default=0.05).
Returns:
dict: Contains the test statistic, p-value, and interpretation of the test.
"""
# Cross-validation scores for train and test sets
cv_train_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=make_scorer(metric))
cv_test_scores = cross_val_score(model, X_test, y_test, cv=k, scoring=make_scorer(metric))
# Perform Kolmogorov-Smirnov Test
ks_stat, ks_p_value = ks_2samp(cv_train_scores, cv_test_scores)
# Interpret results
ks_significant = "significant difference" if ks_p_value <= alpha else "no significant difference"
return {
"Kolmogorov-Smirnov Test": {
"Statistic": ks_stat,
"P-Value": ks_p_value,
"Significance": ks_significant
}
}
# Demo Usage
if __name__ == "__main__":
# Generate synthetic dataset
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = X[:800], X[800:], y[:800], y[800:]
# Define model and metric
chosen_model = RandomForestClassifier(random_state=42)
chosen_metric = accuracy_score
# Run the distribution comparison diagnostic test
results = distribution_comparison_diagnostic(
X_train, y_train,
X_test, y_test,
chosen_model,
chosen_metric,
k=10
)
# Print Results
print("Distribution Comparison Diagnostic Test Results:")
for test_name, result in results.items():
print(f"\n{test_name}:")
for key, value in result.items():
print(f" - {key}: {value}")
Example Output
Distribution Comparison Diagnostic Test Results:
Kolmogorov-Smirnov Test:
- Statistic: 0.3
- P-Value: 0.7869297884777761
- Significance: no significant difference