Residual Error Confusion Matrix

Diagnostics

Procedure

This procedure evaluates whether the distributions of classification errors (confusion matrices) for train and test sets are statistically consistent, providing insights into model performance and generalization.

Gather Classification Error Data
- What to do: Collect confusion matrices for the train set and test set for the chosen machine learning algorithms.
  - Confusion matrices should summarize true positives, false positives, true negatives, and false negatives.
  - Ensure that you have confusion matrices for all algorithms being compared.
Choose a Statistical Test
- What to do: Select the Chi-Squared Test to compare the distributions of classification errors.
  - The Chi-Squared Test is appropriate for categorical data and evaluates whether observed differences between distributions are statistically significant.
  - Ensure that both confusion matrices are comparable in terms of their structure and class labels.
Prepare the Data for Testing
- What to do: Format the data for input into the Chi-Squared Test.
  - Flatten each confusion matrix into a 1-dimensional array of counts (e.g., [TP, FP, TN, FN]).
  - Combine these arrays into a contingency table for comparison.
Perform the Chi-Squared Test
- What to do: Apply the Chi-Squared Test to evaluate the similarity between train and test error distributions.
  - Input the contingency table into the test function.
  - Ensure the sample size for each category is sufficient for the test’s assumptions.
Evaluate Statistical Significance
- What to do: Analyze the p-value from the Chi-Squared Test.
  - A p-value less than the chosen significance level (e.g., 0.05) indicates that the distributions of classification errors are significantly different.
  - A p-value greater than the significance level suggests no significant difference between the distributions.
Interpret the Results
- What to do: Derive insights based on the statistical test results.
  - If distributions are similar, the model is consistent across train and test sets, indicating good generalization.
  - If distributions differ significantly, investigate potential causes such as overfitting, imbalanced data, or data leakage.
Report the Findings
- What to do: Summarize the statistical test results and their implications.
  - Include the Chi-Squared Test statistic, p-value, and interpretation of the findings.
  - Discuss potential causes for any significant differences and suggest next steps, such as adjusting model complexity or improving data preprocessing.

Code Example

This Python function evaluates whether the classification error distributions (confusion matrices) on the train and test sets are statistically similar using the Chi-Squared test.

import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_predict
from scipy.stats import chi2_contingency

def classification_error_distribution_test(
    X_train, y_train, X_test, y_test, algorithms, k=10, alpha=0.05
):
    """
    Test whether the classification error distributions (confusion matrices) on
    the train and test sets are statistically similar using the Chi-Squared test.

    Parameters:
        X_train (np.ndarray): Train set features.
        y_train (np.ndarray): Train set labels.
        X_test (np.ndarray): Test set features.
        y_test (np.ndarray): Test set labels.
        algorithms (list): List of classification models to evaluate.
        k (int): Number of folds for cross-validation (default=10).
        alpha (float): P-value threshold for statistical significance (default=0.05).

    Returns:
        dict: Dictionary with test results and interpretation for each algorithm.
    """
    results = {}

    for alg in algorithms:
        # Cross-validation predictions for train and test sets
        train_preds = cross_val_predict(alg, X_train, y_train, cv=k)
        test_preds = cross_val_predict(alg, X_test, y_test, cv=k)

        # Compute confusion matrices
        train_cm = confusion_matrix(y_train, train_preds)
        test_cm = confusion_matrix(y_test, test_preds)

        # Flatten confusion matrices and combine into a contingency table
        contingency_table = np.vstack((train_cm.flatten(), test_cm.flatten()))

        # Perform Chi-Squared test
        chi2, p_value, _, _ = chi2_contingency(contingency_table)

        # Interpret the results
        significance = "significant" if p_value <= alpha else "not significant"
        if significance == "not significant":
            interpretation = (
                "The classification error distributions are statistically similar, "
                "indicating consistent model performance on train and test sets."
            )
        else:
            interpretation = (
                "The classification error distributions are significantly different, "
                "suggesting potential overfitting, data leakage, or generalization issues."
            )

        # Store results
        results[alg.__class__.__name__] = {
            "Chi-Squared Statistic": chi2,
            "P-Value": p_value,
            "Significance": significance,
            "Interpretation": interpretation,
        }

    return results

# Demo Usage
if __name__ == "__main__":
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.datasets import make_classification

    # Generate synthetic classification data
    np.random.seed(42)
    X_train, y_train = make_classification(n_samples=200, n_features=10, random_state=42)
    X_test, y_test = make_classification(n_samples=200, n_features=10, random_state=24)

    # Define models
    algorithms = [
        LogisticRegression(max_iter=1000),
        DecisionTreeClassifier(),
        RandomForestClassifier(n_estimators=10),
    ]

    # Perform the diagnostic test
    results = classification_error_distribution_test(
        X_train, y_train, X_test, y_test, algorithms, alpha=0.05
    )

    # Print results
    print("Classification Error Distribution Test Results:")
    for model, result in results.items():
        print(f"{model}:")
        for key, value in result.items():
            print(f"  - {key}: {value}")

Example Output

Classification Error Distribution Test Results:
LogisticRegression:
  - Chi-Squared Statistic: 2.237417351323617
  - P-Value: 0.5246159717212884
  - Significance: not significant
  - Interpretation: The classification error distributions are statistically similar, indicating consistent model performance on train and test sets.
DecisionTreeClassifier:
  - Chi-Squared Statistic: 3.1919863082653777
  - P-Value: 0.3629612584811778
  - Significance: not significant
  - Interpretation: The classification error distributions are statistically similar, indicating consistent model performance on train and test sets.
RandomForestClassifier:
  - Chi-Squared Statistic: 3.164846698376544
  - P-Value: 0.36690070855912804
  - Significance: not significant
  - Interpretation: The classification error distributions are statistically similar, indicating consistent model performance on train and test sets.

Residual Error Effect Size