Performance Effect Size
Procedure
Assumes a given chosen model and performance metric and that we are using k-fold cross-validation for an apples-to-apples performance comparison.
-
Gather Performance Scores for Train and Test Sets
- What to do: Apply k-fold cross-validation to the train and test sets using the chosen model and performance metric.
- Generate a sample of performance scores for the train set.
- Generate a sample of performance scores for the test set.
- What to do: Apply k-fold cross-validation to the train and test sets using the chosen model and performance metric.
-
Compute Effect Size
- What to do: Quantify the magnitude of the difference between train and test performance scores using Cohen’s d.
- Calculate Cohen’s d using the formula: $ d = \frac{\text{mean(train scores)} - \text{mean(test scores)}}{\text{pooled standard deviation}} $.
- Ensure that you calculate the pooled standard deviation accurately for the two samples.
- What to do: Quantify the magnitude of the difference between train and test performance scores using Cohen’s d.
-
Interpret the Effect Size
- What to do: Use Cohen’s d thresholds to determine the practical significance of the difference:
- Small effect: $ d \approx 0.2 $.
- Medium effect: $ d \approx 0.5 $.
- Large effect: $ d \approx 0.8 $.
- Assess if the difference is meaningful based on the size of the effect:
- If the effect size is small, the performance gap is likely negligible, indicating the model generalizes well.
- If the effect size is medium or large, further investigation into potential overfitting or dataset issues may be warranted.
- What to do: Use Cohen’s d thresholds to determine the practical significance of the difference:
-
Report the Findings
- What to do: Summarize the calculated effect size and provide an interpretation of its significance.
- Clearly state whether the difference between train and test performance scores is practically significant.
- If applicable, provide recommendations for further analysis or adjustments to the model or data preparation process.
- What to do: Summarize the calculated effect size and provide an interpretation of its significance.
Code Example
This function calculates the effect size (Cohen’s d) for performance differences between train and test set cross-validation scores, interprets the effect size, and provides a clear result.
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
def effect_size_diagnostic(X_train, y_train, X_test, y_test, model, metric, k):
"""
Perform a diagnostic test to calculate and interpret the effect size (Cohen's d) between
train and test cross-validation performance scores.
Parameters:
X_train, y_train: Train set X and y elements.
X_test, y_test: Test set X and y elements.
model: The machine learning model to evaluate.
metric: Performance metric function compatible with scikit-learn.
k (int): Number of folds for cross-validation.
Returns:
dict: Dictionary containing effect size, interpretation, and practical significance.
"""
# Cross-validation scores for the train set
cv_train_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=make_scorer(metric))
# Cross-validation scores for the test set
cv_test_scores = cross_val_score(model, X_test, y_test, cv=k, scoring=make_scorer(metric))
# Calculate mean and pooled standard deviation
mean_train = np.mean(cv_train_scores)
mean_test = np.mean(cv_test_scores)
pooled_std = np.sqrt((np.var(cv_train_scores, ddof=1) + np.var(cv_test_scores, ddof=1)) / 2)
# Calculate Cohen's d
cohen_d = (mean_train - mean_test) / pooled_std
# Interpret Cohen's d
if abs(cohen_d) < 0.2:
interpretation = "Negligible difference"
elif abs(cohen_d) < 0.5:
interpretation = "Small difference"
elif abs(cohen_d) < 0.8:
interpretation = "Medium difference"
else:
interpretation = "Large difference"
return {
"Cohen's d": cohen_d,
"Interpretation": interpretation
}
# Demo Usage
if __name__ == "__main__":
# Generate synthetic dataset
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# Create synthetic data
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define model and metric
chosen_model = RandomForestClassifier(random_state=42)
chosen_metric = accuracy_score
# Run the effect size diagnostic test
results = effect_size_diagnostic(
X_train, y_train,
X_test, y_test,
chosen_model,
chosen_metric,
10
)
# Print Results
print("Effect Size Diagnostic Test Results:")
for key, value in results.items():
print(f" - {key}: {value}")
Example Output
Effect Size Diagnostic Test Results:
- Cohen's d: 1.0754961161637606
- Interpretation: Large difference