Residual Error Central Tendencies

Diagnostics

Procedure

This procedure evaluates whether the residual errors from model predictions on the train and test sets have the same central tendencies, which can provide insights into the consistency of the model’s performance across different datasets.

Gather Residual Error Samples
- What to do: Ensure you have residual error samples for both the train set and the test set for your chosen machine learning algorithms.
  - Residual errors are calculated as the difference between actual values and predicted values for each data point.
  - These should be collected for each algorithm in your chosen suite of machine learning models.
Check Assumptions
- What to do: Validate the assumptions of your chosen statistical test.
  - For t-Tests, check if the residual error samples follow a normal distribution and have similar variances.
  - For Mann-Whitney U Tests, no specific distribution assumptions are required, but the samples should be independent.
Perform the Statistical Test
- What to do: Use the appropriate test to compare the central tendencies of the residual error distributions.
  - For normally distributed errors with equal variances, use a t-Test.
  - For non-normal or unknown distributions, use the Mann-Whitney U Test.
  - Apply the test separately for each algorithm in the suite to compare train and test error distributions.
Evaluate Statistical Significance
- What to do: Analyze the p-value to determine whether the differences in central tendencies are statistically significant.
  - A p-value less than your chosen significance level (e.g., 0.05) indicates a statistically significant difference.
  - Record the p-value for each algorithm to assess its specific behavior.
Interpret the Results
- What to do: Draw conclusions based on the test outcomes.
  - If no significant differences are found, it suggests that the model generalizes well across train and test sets.
  - Significant differences indicate potential overfitting, underfitting, or data quality issues.
  - Consider which algorithms show consistent performance and investigate outliers for further insights.
Report the Findings
- What to do: Summarize the results and their implications for model performance.
  - Present the p-values for all algorithms, highlighting any significant differences.
  - Include any trends observed across multiple algorithms (e.g., consistent overfitting).
  - Use visualizations like boxplots or density plots to show the distribution of residual errors for the train and test sets.
  - Suggest next steps, such as hyperparameter tuning, data preprocessing, or selecting models with better generalization.

Code Example

This Python function evaluates whether the residual errors from model predictions on the train and test sets have the same central tendencies using statistical tests, and provides an interpretation of the results.

import numpy as np
from scipy.stats import ttest_ind, mannwhitneyu, shapiro
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error

def residual_distribution_test(
    X_train, y_train, X_test, y_test, algorithms, metric=mean_squared_error, k=10, alpha=0.05
):
    """
    Test whether the residual errors on the train and test sets have the same central tendencies.

    Parameters:
        X_train (np.ndarray): Train set features.
        y_train (np.ndarray): Train set target.
        X_test (np.ndarray): Test set features.
        y_test (np.ndarray): Test set target.
        algorithms (list): List of machine learning models to evaluate.
        metric (callable): Performance metric function (default=mean_squared_error).
        k (int): Number of folds for cross-validation (default=10).
        alpha (float): P-value threshold for statistical significance (default=0.05).

    Returns:
        dict: Dictionary with test results and interpretation for each algorithm.
    """
    results = {}

    for alg in algorithms:
        # Generate predictions and calculate residuals
        train_preds = cross_val_predict(alg, X_train, y_train, cv=k)
        train_residuals = y_train - train_preds
        test_preds = cross_val_predict(alg, X_test, y_test, cv=k)
        test_residuals = y_test - test_preds

        # Check normality of residuals
        _, p_train_normal = shapiro(train_residuals)
        _, p_test_normal = shapiro(test_residuals)

        # Select statistical test based on normality
        if p_train_normal > alpha and p_test_normal > alpha:
            test_name = "t-Test"
            stat, p_value = ttest_ind(train_residuals, test_residuals, equal_var=False)
        else:
            test_name = "Mann-Whitney U Test"
            stat, p_value = mannwhitneyu(train_residuals, test_residuals)

        # Interpret results
        significance = "significant" if p_value <= alpha else "not significant"
        interpretation = (
            f"The {test_name} indicates the residuals are {'not ' if significance == 'not significant' else ''}"
            "statistically different between train and test sets, suggesting potential issues with generalization."
        )

        results[alg.__class__.__name__] = {
            "Statistical Test": test_name,
            "Test Statistic": stat,
            "P-Value": p_value,
            "Significance": significance,
            "Interpretation": interpretation,
        }

    return results

# Demo Usage
if __name__ == "__main__":
    from sklearn.linear_model import LinearRegression
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.ensemble import RandomForestRegressor

    # Generate synthetic data
    np.random.seed(42)
    X_train = np.random.normal(loc=0, scale=1, size=(100, 5))
    y_train = X_train[:, 0] * 3 + np.random.normal(loc=0, scale=0.5, size=100)
    X_test = np.random.normal(loc=0.2, scale=1.2, size=(100, 5))
    y_test = X_test[:, 0] * 3 + np.random.normal(loc=0, scale=0.5, size=100)

    # Define models
    algorithms = [LinearRegression(), DecisionTreeRegressor(), RandomForestRegressor(n_estimators=10)]

    # Perform the diagnostic test
    results = residual_distribution_test(X_train, y_train, X_test, y_test, algorithms, alpha=0.05)

    # Print results
    print("Residual Distribution Test Results:")
    for model, result in results.items():
        print(f"{model}:")
        for key, value in result.items():
            print(f"  - {key}: {value}")

Example Output

Residual Distribution Test Results:
LinearRegression:
  - Statistical Test: t-Test
  - Test Statistic: -0.14433897980598043
  - P-Value: 0.8853854900983777
  - Significance: not significant
  - Interpretation: The t-Test indicates the residuals are not statistically different between train and test sets, suggesting potential issues with generalization.
DecisionTreeRegressor:
  - Statistical Test: t-Test
  - Test Statistic: 0.9467023040959144
  - P-Value: 0.3450543851725214
  - Significance: not significant
  - Interpretation: The t-Test indicates the residuals are not statistically different between train and test sets, suggesting potential issues with generalization.
RandomForestRegressor:
  - Statistical Test: t-Test
  - Test Statistic: 0.16677158459092223
  - P-Value: 0.8677295873972957
  - Significance: not significant
  - Interpretation: The t-Test indicates the residuals are not statistically different between train and test sets, suggesting potential issues with generalization.

Residual Error Distributions Correlated Residual Error Distributions