Residual Error Effect Size

Diagnostics

Procedure

This procedure evaluates whether the residual errors (model prediction errors) on the train set and test set have the same distributions, helping to assess the model’s generalization consistency.

Gather Residual Error Samples
- What to do: Ensure you have prediction error samples for the train set and test set for your chosen machine learning algorithms.
  - Errors should be calculated as the difference between actual and predicted values for each data point.
  - Ensure the data has been split into train and test sets, and errors are collected for each set.
Calculate Effect Size
- What to do: Quantify the magnitude of the difference in residual distributions using an effect size measure.
  - Compute Cohen’s d to measure the standardized mean difference between train and test residuals:
    - $ d = \frac{M_1 - M_2}{SD_{pooled}} $, where $ M_1 $ and $ M_2 $ are the means of the train and test residuals, and $ SD_{pooled} $ is the pooled standard deviation.
  - Interpret the value:
    - Small effect size: |d| < 0.2
    - Medium effect size: |d| = 0.2 to 0.5
    - Large effect size: |d| > 0.5
  - Ensure both train and test samples are large enough for a reliable estimate of Cohen’s d.
Interpret the Effect Size
- What to do: Use the effect size to determine whether the train and test residuals are meaningfully different.
  - A small effect size indicates negligible differences, suggesting consistent model behavior across train and test sets.
  - A large effect size suggests significant differences, indicating potential issues such as overfitting, underfitting, or data leakage.
Report the Findings
- What to do: Summarize your findings based on the calculated effect size.
  - Clearly state the calculated Cohen’s d value and its interpretation (small, medium, or large).
  - Discuss any implications for the model’s generalization ability.
  - Provide actionable recommendations, such as revisiting feature engineering, regularization techniques, or data preprocessing if large effect sizes are observed.
  - Include visualizations, such as histograms or box plots comparing the residual distributions, to support the conclusions.

Code Example

This Python function evaluates the effect size of the difference in residual errors between the train and test sets for a suite of machine learning algorithms, helping assess generalization consistency.

import numpy as np
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error

def residual_effect_size_test(
    X_train, y_train, X_test, y_test, algorithms, metric=mean_squared_error, k=10
):
    """
    Evaluate the effect size of the difference in residual errors between train and test sets.

    Parameters:
        X_train (np.ndarray): Train set features.
        y_train (np.ndarray): Train set target.
        X_test (np.ndarray): Test set features.
        y_test (np.ndarray): Test set target.
        algorithms (list): List of machine learning models to evaluate.
        metric (callable): Performance metric function (default=mean_squared_error).
        k (int): Number of folds for cross-validation (default=10).

    Returns:
        dict: Dictionary with effect size and interpretation for each algorithm.
    """
    results = {}

    for alg in algorithms:
        # Cross-validation predictions on the train set
        train_preds = cross_val_predict(alg, X_train, y_train, cv=k)
        train_residuals = y_train - train_preds

        # Cross-validation predictions on the test set
        test_preds = cross_val_predict(alg, X_test, y_test, cv=k)
        test_residuals = y_test - test_preds

        # Calculate means and pooled standard deviation
        train_mean = np.mean(train_residuals)
        test_mean = np.mean(test_residuals)
        pooled_std = np.sqrt((np.std(train_residuals, ddof=1)**2 + np.std(test_residuals, ddof=1)**2) / 2)

        # Calculate Cohen's d effect size
        cohen_d = (train_mean - test_mean) / pooled_std

        # Interpret Cohen's d
        if abs(cohen_d) < 0.2:
            interpretation = "Small effect size, negligible difference in residual errors."
        elif 0.2 <= abs(cohen_d) < 0.5:
            interpretation = "Medium effect size, moderate difference in residual errors."
        else:
            interpretation = "Large effect size, significant difference in residual errors."

        # Store results
        results[alg.__class__.__name__] = {
            "Cohen's d": cohen_d,
            "Interpretation": interpretation,
        }

    return results

# Demo Usage
if __name__ == "__main__":
    from sklearn.linear_model import LinearRegression
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.ensemble import RandomForestRegressor

    # Generate synthetic data
    np.random.seed(42)
    X_train = np.random.normal(loc=0, scale=1, size=(100, 5))
    y_train = X_train[:, 0] * 2 + np.random.normal(loc=0, scale=0.5, size=100)
    X_test = np.random.normal(loc=0.2, scale=1.2, size=(100, 5))
    y_test = X_test[:, 0] * 2 + np.random.normal(loc=0, scale=0.5, size=100)

    # Define models
    algorithms = [LinearRegression(), DecisionTreeRegressor(), RandomForestRegressor(n_estimators=10)]

    # Perform the diagnostic test
    results = residual_effect_size_test(X_train, y_train, X_test, y_test, algorithms, k=5)

    # Print results
    print("Residual Effect Size Test Results:")
    for model, result in results.items():
        print(f"{model}:")
        for key, value in result.items():
            print(f"  - {key}: {value}")

Example Output

Residual Effect Size Test Results:
LinearRegression:
  - Cohen's d: -0.00640635333539964
  - Interpretation: Small effect size, negligible difference in residual errors.
DecisionTreeRegressor:
  - Cohen's d: 0.08863288026208745
  - Interpretation: Small effect size, negligible difference in residual errors.
RandomForestRegressor:
  - Cohen's d: -0.0891434923059157
  - Interpretation: Small effect size, negligible difference in residual errors.

Residual Error Variance Residual Error Confusion Matrix