Residual Error Confusion Matrix
Procedure
This procedure evaluates whether the distributions of classification errors (confusion matrices) for train and test sets are statistically consistent, providing insights into model performance and generalization.
-
Gather Classification Error Data
- What to do: Collect confusion matrices for the train set and test set for the chosen machine learning algorithms.
- Confusion matrices should summarize true positives, false positives, true negatives, and false negatives.
- Ensure that you have confusion matrices for all algorithms being compared.
- What to do: Collect confusion matrices for the train set and test set for the chosen machine learning algorithms.
-
Choose a Statistical Test
- What to do: Select the Chi-Squared Test to compare the distributions of classification errors.
- The Chi-Squared Test is appropriate for categorical data and evaluates whether observed differences between distributions are statistically significant.
- Ensure that both confusion matrices are comparable in terms of their structure and class labels.
- What to do: Select the Chi-Squared Test to compare the distributions of classification errors.
-
Prepare the Data for Testing
- What to do: Format the data for input into the Chi-Squared Test.
- Flatten each confusion matrix into a 1-dimensional array of counts (e.g., [TP, FP, TN, FN]).
- Combine these arrays into a contingency table for comparison.
- What to do: Format the data for input into the Chi-Squared Test.
-
Perform the Chi-Squared Test
- What to do: Apply the Chi-Squared Test to evaluate the similarity between train and test error distributions.
- Input the contingency table into the test function.
- Ensure the sample size for each category is sufficient for the test’s assumptions.
- What to do: Apply the Chi-Squared Test to evaluate the similarity between train and test error distributions.
-
Evaluate Statistical Significance
- What to do: Analyze the p-value from the Chi-Squared Test.
- A p-value less than the chosen significance level (e.g., 0.05) indicates that the distributions of classification errors are significantly different.
- A p-value greater than the significance level suggests no significant difference between the distributions.
- What to do: Analyze the p-value from the Chi-Squared Test.
-
Interpret the Results
- What to do: Derive insights based on the statistical test results.
- If distributions are similar, the model is consistent across train and test sets, indicating good generalization.
- If distributions differ significantly, investigate potential causes such as overfitting, imbalanced data, or data leakage.
- What to do: Derive insights based on the statistical test results.
-
Report the Findings
- What to do: Summarize the statistical test results and their implications.
- Include the Chi-Squared Test statistic, p-value, and interpretation of the findings.
- Discuss potential causes for any significant differences and suggest next steps, such as adjusting model complexity or improving data preprocessing.
- What to do: Summarize the statistical test results and their implications.
Code Example
This Python function evaluates whether the classification error distributions (confusion matrices) on the train and test sets are statistically similar using the Chi-Squared test.
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_predict
from scipy.stats import chi2_contingency
def classification_error_distribution_test(
X_train, y_train, X_test, y_test, algorithms, k=10, alpha=0.05
):
"""
Test whether the classification error distributions (confusion matrices) on
the train and test sets are statistically similar using the Chi-Squared test.
Parameters:
X_train (np.ndarray): Train set features.
y_train (np.ndarray): Train set labels.
X_test (np.ndarray): Test set features.
y_test (np.ndarray): Test set labels.
algorithms (list): List of classification models to evaluate.
k (int): Number of folds for cross-validation (default=10).
alpha (float): P-value threshold for statistical significance (default=0.05).
Returns:
dict: Dictionary with test results and interpretation for each algorithm.
"""
results = {}
for alg in algorithms:
# Cross-validation predictions for train and test sets
train_preds = cross_val_predict(alg, X_train, y_train, cv=k)
test_preds = cross_val_predict(alg, X_test, y_test, cv=k)
# Compute confusion matrices
train_cm = confusion_matrix(y_train, train_preds)
test_cm = confusion_matrix(y_test, test_preds)
# Flatten confusion matrices and combine into a contingency table
contingency_table = np.vstack((train_cm.flatten(), test_cm.flatten()))
# Perform Chi-Squared test
chi2, p_value, _, _ = chi2_contingency(contingency_table)
# Interpret the results
significance = "significant" if p_value <= alpha else "not significant"
if significance == "not significant":
interpretation = (
"The classification error distributions are statistically similar, "
"indicating consistent model performance on train and test sets."
)
else:
interpretation = (
"The classification error distributions are significantly different, "
"suggesting potential overfitting, data leakage, or generalization issues."
)
# Store results
results[alg.__class__.__name__] = {
"Chi-Squared Statistic": chi2,
"P-Value": p_value,
"Significance": significance,
"Interpretation": interpretation,
}
return results
# Demo Usage
if __name__ == "__main__":
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Generate synthetic classification data
np.random.seed(42)
X_train, y_train = make_classification(n_samples=200, n_features=10, random_state=42)
X_test, y_test = make_classification(n_samples=200, n_features=10, random_state=24)
# Define models
algorithms = [
LogisticRegression(max_iter=1000),
DecisionTreeClassifier(),
RandomForestClassifier(n_estimators=10),
]
# Perform the diagnostic test
results = classification_error_distribution_test(
X_train, y_train, X_test, y_test, algorithms, alpha=0.05
)
# Print results
print("Classification Error Distribution Test Results:")
for model, result in results.items():
print(f"{model}:")
for key, value in result.items():
print(f" - {key}: {value}")
Example Output
Classification Error Distribution Test Results:
LogisticRegression:
- Chi-Squared Statistic: 2.237417351323617
- P-Value: 0.5246159717212884
- Significance: not significant
- Interpretation: The classification error distributions are statistically similar, indicating consistent model performance on train and test sets.
DecisionTreeClassifier:
- Chi-Squared Statistic: 3.1919863082653777
- P-Value: 0.3629612584811778
- Significance: not significant
- Interpretation: The classification error distributions are statistically similar, indicating consistent model performance on train and test sets.
RandomForestClassifier:
- Chi-Squared Statistic: 3.164846698376544
- P-Value: 0.36690070855912804
- Significance: not significant
- Interpretation: The classification error distributions are statistically similar, indicating consistent model performance on train and test sets.