Performance Effect Size

Diagnostics

Procedure

Assumes a given chosen model and performance metric and that we are using k-fold cross-validation for an apples-to-apples performance comparison.

Gather Performance Scores for Train and Test Sets
- What to do: Apply k-fold cross-validation to the train and test sets using the chosen model and performance metric.
  - Generate a sample of performance scores for the train set.
  - Generate a sample of performance scores for the test set.
Compute Effect Size
- What to do: Quantify the magnitude of the difference between train and test performance scores using Cohen’s d.
  - Calculate Cohen’s d using the formula: $d = \frac{\text{mean(train scores)} - \text{mean(test scores)}}{\text{pooled standard deviation}}$ .
  - Ensure that you calculate the pooled standard deviation accurately for the two samples.
Interpret the Effect Size
- What to do: Use Cohen’s d thresholds to determine the practical significance of the difference:
  - Small effect: $d \approx 0.2$ .
  - Medium effect: $d \approx 0.5$ .
  - Large effect: $d \approx 0.8$ .
- Assess if the difference is meaningful based on the size of the effect:
  - If the effect size is small, the performance gap is likely negligible, indicating the model generalizes well.
  - If the effect size is medium or large, further investigation into potential overfitting or dataset issues may be warranted.
Report the Findings
- What to do: Summarize the calculated effect size and provide an interpretation of its significance.
  - Clearly state whether the difference between train and test performance scores is practically significant.
  - If applicable, provide recommendations for further analysis or adjustments to the model or data preparation process.

Code Example

This function calculates the effect size (Cohen’s d) for performance differences between train and test set cross-validation scores, interprets the effect size, and provides a clear result.

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer

def effect_size_diagnostic(X_train, y_train, X_test, y_test, model, metric, k):
    """
    Perform a diagnostic test to calculate and interpret the effect size (Cohen's d) between
    train and test cross-validation performance scores.

    Parameters:
        X_train, y_train: Train set X and y elements.
        X_test, y_test: Test set X and y elements.
        model: The machine learning model to evaluate.
        metric: Performance metric function compatible with scikit-learn.
        k (int): Number of folds for cross-validation.

    Returns:
        dict: Dictionary containing effect size, interpretation, and practical significance.
    """
    # Cross-validation scores for the train set
    cv_train_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=make_scorer(metric))
    # Cross-validation scores for the test set
    cv_test_scores = cross_val_score(model, X_test, y_test, cv=k, scoring=make_scorer(metric))

    # Calculate mean and pooled standard deviation
    mean_train = np.mean(cv_train_scores)
    mean_test = np.mean(cv_test_scores)
    pooled_std = np.sqrt((np.var(cv_train_scores, ddof=1) + np.var(cv_test_scores, ddof=1)) / 2)

    # Calculate Cohen's d
    cohen_d = (mean_train - mean_test) / pooled_std

    # Interpret Cohen's d
    if abs(cohen_d) < 0.2:
        interpretation = "Negligible difference"
    elif abs(cohen_d) < 0.5:
        interpretation = "Small difference"
    elif abs(cohen_d) < 0.8:
        interpretation = "Medium difference"
    else:
        interpretation = "Large difference"

    return {
        "Cohen's d": cohen_d,
        "Interpretation": interpretation
    }

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic dataset
    from sklearn.datasets import make_classification
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split

    # Create synthetic data
    X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Define model and metric
    chosen_model = RandomForestClassifier(random_state=42)
    chosen_metric = accuracy_score

    # Run the effect size diagnostic test
    results = effect_size_diagnostic(
        X_train, y_train,
        X_test, y_test,
        chosen_model,
        chosen_metric,
        10
    )

    # Print Results
    print("Effect Size Diagnostic Test Results:")
    for key, value in results.items():
        print(f"  - {key}: {value}")

Example Output

Effect Size Diagnostic Test Results:
  - Cohen's d: 1.0754961161637606
  - Interpretation: Large difference

Performance Variance Performance Divergence