Residual Error Distributions Correlated

Residual Error Distributions Correlated

Procedure

This procedure evaluates whether the residual errors from model predictions on the train and test sets are highly correlated, providing insights into model consistency and generalization.

  1. Gather Residual Error Samples

    • What to do: Ensure you have residual error samples for the train set and test set for your chosen machine learning algorithms.
      • Residual errors are calculated as the difference between actual values and predicted values for each data point.
      • These should already be split into train and test samples.
  2. Calculate Correlation

    • What to do: Compute the correlation between residual errors on the train and test sets.
      • Use Pearson’s correlation coefficient if you expect a linear relationship.
      • Use Spearman’s rank correlation coefficient if the relationship might be non-linear but monotonic.
      • Ensure that both residual error samples are of similar length for accurate correlation computation.
  3. Interpret Correlation Coefficient

    • What to do: Analyze the correlation value to assess the relationship between train and test residuals.
      • A correlation coefficient close to +1 indicates a strong positive relationship, suggesting consistent residual patterns.
      • A coefficient close to 0 suggests no relationship, indicating potential inconsistencies in model behavior.
      • A negative coefficient implies an inverse relationship, which may be unexpected and requires further investigation.
  4. Evaluate Statistical Significance

    • What to do: Assess whether the correlation is statistically significant.
      • Perform a hypothesis test for the correlation coefficient to check if it significantly differs from 0.
      • Record the p-value. A p-value less than the chosen significance level (e.g., 0.05) indicates a significant correlation.
  5. Interpret the Results

    • What to do: Derive conclusions based on the correlation and its significance.
      • If the correlation is high and positive, it confirms consistency between train and test residuals, which is desirable.
      • If the correlation is low or non-significant, it suggests inconsistencies, indicating that the model may generalize poorly.
      • If the correlation is unexpectedly negative, it may suggest overfitting or issues in data partitioning or preprocessing.
  6. Report the Findings

    • What to do: Summarize your results and their implications.
      • Include the calculated correlation coefficient, p-value, and interpretation of the results.
      • Highlight any unexpected findings, such as low or negative correlations, and suggest potential next steps (e.g., model refinement or data investigation).
      • Use visualizations (e.g., scatter plots of residuals) to support your analysis.

Code Example

This Python function evaluates whether the residual errors from model predictions on the train and test sets are highly correlated, automatically determining whether the residuals follow a normal distribution and using the appropriate correlation method.

import numpy as np
from scipy.stats import pearsonr, spearmanr, shapiro
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error

def residual_correlation_test(
    X_train, y_train, X_test, y_test, algorithms, metric=mean_squared_error, k=10, alpha=0.05
):
    """
    Test whether the residual errors on the train and test sets are highly correlated.

    Parameters:
        X_train (np.ndarray): Train set features.
        y_train (np.ndarray): Train set target.
        X_test (np.ndarray): Test set features.
        y_test (np.ndarray): Test set target.
        algorithms (list): List of machine learning models to evaluate.
        metric (callable): Performance metric function (default=mean_squared_error).
        k (int): Number of folds for cross-validation (default=10).
        alpha (float): P-value threshold for statistical significance (default=0.05).

    Returns:
        dict: Dictionary with correlation results and interpretation for each algorithm.
    """
    results = {}

    for alg in algorithms:
        # Cross-validation predictions on the train set
        train_preds = cross_val_predict(alg, X_train, y_train, cv=k)
        train_residuals = y_train - train_preds

        # Cross-validation predictions on the test set
        test_preds = cross_val_predict(alg, X_test, y_test, cv=k)
        test_residuals = y_test - test_preds

        # Check normality of residuals
        _, p_train_normal = shapiro(train_residuals)
        _, p_test_normal = shapiro(test_residuals)

        # Choose appropriate correlation method
        if p_train_normal > alpha and p_test_normal > alpha:
            method = "Pearson"
            correlation, p_value = pearsonr(train_residuals, test_residuals)
        else:
            method = "Spearman"
            correlation, p_value = spearmanr(train_residuals, test_residuals)

        # Interpret the results
        significance = "significant" if p_value <= alpha else "not significant"
        # Interpret the correlation coefficient
        if correlation > 0.8:
            interpretation = "The residuals have a high positive correlation, indicating strong consistency and good generalization."
        elif 0.5 < correlation <= 0.8:
            interpretation = "The residuals have a moderate positive correlation, indicating reasonable consistency but potential minor generalization issues."
        elif 0.2 < correlation <= 0.5:
            interpretation = "The residuals have a low positive correlation, suggesting weak consistency and potential generalization concerns."
        elif -0.2 <= correlation <= 0.2:
            interpretation = "The residuals have little to no correlation, indicating poor consistency and likely generalization problems."
        else:
            interpretation = "The residuals have a negative correlation, suggesting inverse relationships and severe generalization issues."

        # Store results
        results[alg.__class__.__name__] = {
            "Correlation Method": method,
            "Correlation": correlation,
            "P-Value": p_value,
            "Significance": significance,
            "Interpretation": interpretation,
        }

    return results

# Demo Usage
if __name__ == "__main__":
    from sklearn.linear_model import LinearRegression
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.ensemble import RandomForestRegressor

    # Generate synthetic data
    np.random.seed(42)
    X_train = np.random.normal(loc=0, scale=1, size=(100, 5))
    y_train = X_train[:, 0] * 2 + np.random.normal(loc=0, scale=0.5, size=100)
    X_test = np.random.normal(loc=0.2, scale=1.2, size=(100, 5))
    y_test = X_test[:, 0] * 2 + np.random.normal(loc=0, scale=0.5, size=100)

    # Define models
    algorithms = [LinearRegression(), DecisionTreeRegressor(), RandomForestRegressor(n_estimators=10)]

    # Perform the diagnostic test
    results = residual_correlation_test(X_train, y_train, X_test, y_test, algorithms, alpha=0.05)

    # Print results
    print("Residual Correlation Test Results:")
    for model, result in results.items():
        print(f"{model}:")
        for key, value in result.items():
            print(f"  - {key}: {value}")

Example Output

Residual Correlation Test Results:
LinearRegression:
  - Correlation Method: Pearson
  - Correlation: -0.13994348478847515
  - P-Value: 0.1649287316055379
  - Significance: not significant
  - Interpretation: The residuals have little to no correlation, indicating poor consistency and likely generalization problems.
DecisionTreeRegressor:
  - Correlation Method: Pearson
  - Correlation: -0.09194504725426
  - P-Value: 0.3629175215329332
  - Significance: not significant
  - Interpretation: The residuals have little to no correlation, indicating poor consistency and likely generalization problems.
RandomForestRegressor:
  - Correlation Method: Pearson
  - Correlation: -0.22060791982493722
  - P-Value: 0.02741252648616371
  - Significance: significant
  - Interpretation: The residuals have a negative correlation, suggesting inverse relationships and severe generalization issues.