Residual Error Distributions
Procedure
This procedure evaluates whether the residual errors from model predictions on the train and test sets follow the same distribution, providing insights into model generalization and consistency.
-
Gather Residual Error Samples
- What to do: Collect residual error samples for the train and test sets for your chosen machine learning algorithms.
- Residual errors are calculated as the difference between actual values and predicted values for each data point.
- Ensure residuals are prepared for all algorithms under evaluation using the same test harness (e.g., 10-fold cross-validation).
- What to do: Collect residual error samples for the train and test sets for your chosen machine learning algorithms.
-
Select a Statistical Test for Distribution Comparison
- What to do: Choose an appropriate statistical test to compare the distributions of train and test residuals.
- Use the Kolmogorov-Smirnov Test for assessing whether two samples come from the same distribution.
- Alternatively, use the Anderson-Darling Test for a more sensitive comparison, especially if small deviations in distribution tails are of concern.
- What to do: Choose an appropriate statistical test to compare the distributions of train and test residuals.
-
Perform the Statistical Test
- What to do: Apply the selected test to compare train and test residuals for each algorithm.
- Compute the test statistic and corresponding p-value.
- Repeat this process for each algorithm in your suite to evaluate consistency across models.
- What to do: Apply the selected test to compare train and test residuals for each algorithm.
-
Interpret the Test Results
- What to do: Evaluate whether the distributions of residuals are statistically similar.
- A p-value greater than the chosen significance level (e.g., 0.05) indicates no significant difference in distributions.
- A p-value less than or equal to the significance level suggests the distributions are significantly different, indicating potential overfitting or data inconsistencies.
- What to do: Evaluate whether the distributions of residuals are statistically similar.
-
Validate Assumptions
- What to do: Confirm the assumptions of your chosen statistical test are satisfied.
- Verify that the residual samples are independent and appropriately scaled for the test.
- Consider using transformations (e.g., log or standardization) if necessary to meet test assumptions.
- What to do: Confirm the assumptions of your chosen statistical test are satisfied.
-
Report the Findings
- What to do: Summarize the results of your distribution comparison.
- Report the test statistic, p-value, and conclusion for each algorithm.
- Highlight any algorithms where train and test residuals significantly differ, as this may indicate overfitting or poor model generalization.
- Use visual aids (e.g., histograms or cumulative distribution plots) to support your findings.
- What to do: Summarize the results of your distribution comparison.
-
Recommend Next Steps
- What to do: Provide actionable recommendations based on the findings.
- For models with significant distribution differences, consider revisiting data preprocessing, feature engineering, or hyperparameter tuning.
- Investigate potential causes such as data leakage, imbalanced classes, or insufficient data splitting strategies.
- What to do: Provide actionable recommendations based on the findings.
Code Example
This Python function evaluates whether the residual errors from model predictions on the train and test sets have the same distributions using statistical tests like Kolmogorov-Smirnov or Anderson-Darling.
import numpy as np
from scipy.stats import ks_2samp, anderson_ksamp
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error
def residual_distribution_test(
X_train, y_train, X_test, y_test, algorithms, metric=mean_squared_error, k=10, alpha=0.05
):
"""
Test whether the residual errors on the train and test sets have the same distributions.
Parameters:
X_train (np.ndarray): Train set features.
y_train (np.ndarray): Train set target.
X_test (np.ndarray): Test set features.
y_test (np.ndarray): Test set target.
algorithms (list): List of machine learning models to evaluate.
metric (callable): Performance metric function (default=mean_squared_error).
k (int): Number of folds for cross-validation (default=10).
alpha (float): P-value threshold for statistical significance (default=0.05).
Returns:
dict: Dictionary with results and interpretation for each algorithm.
"""
results = {}
for alg in algorithms:
# Cross-validation predictions on the train set
train_preds = cross_val_predict(alg, X_train, y_train, cv=k)
train_residuals = y_train - train_preds
# Cross-validation predictions on the test set
test_preds = cross_val_predict(alg, X_test, y_test, cv=k)
test_residuals = y_test - test_preds
# Apply Kolmogorov-Smirnov Test
ks_stat, ks_p_value = ks_2samp(train_residuals, test_residuals)
# Apply Anderson-Darling Test (alternative)
try:
ad_stat, ad_critical, ad_significance = anderson_ksamp([train_residuals, test_residuals])
except ValueError:
ad_stat, ad_significance = np.nan, np.nan
# Interpret the results
ks_significance = "significant" if ks_p_value <= alpha else "not significant"
ad_significance_text = (
"significant" if ad_significance is not np.nan and ad_significance <= alpha else "not significant"
)
# Store results
results[alg.__class__.__name__] = {
"KS Statistic": ks_stat,
"KS P-Value": ks_p_value,
"KS Significance": ks_significance,
"AD Statistic": ad_stat,
"AD Significance": ad_significance_text,
"Interpretation": (
"Residual distributions are similar between train and test sets"
if ks_significance == "not significant"
else "Residual distributions differ significantly between train and test sets"
),
}
return results
# Demo Usage
if __name__ == "__main__":
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
# Generate synthetic data
np.random.seed(42)
X_train = np.random.normal(loc=0, scale=1, size=(100, 5))
y_train = X_train[:, 0] * 2 + np.random.normal(loc=0, scale=0.5, size=100)
X_test = np.random.normal(loc=0.2, scale=1.2, size=(100, 5))
y_test = X_test[:, 0] * 2 + np.random.normal(loc=0, scale=0.5, size=100)
# Define models
algorithms = [LinearRegression(), DecisionTreeRegressor(), RandomForestRegressor(n_estimators=10)]
# Perform the diagnostic test
results = residual_distribution_test(X_train, y_train, X_test, y_test, algorithms, alpha=0.05)
# Print results
print("Residual Distribution Test Results:")
for model, result in results.items():
print(f"{model}:")
for key, value in result.items():
print(f" - {key}: {value}")
Example Output
Residual Distribution Test Results:
LinearRegression:
- KS Statistic: 0.14
- KS P-Value: 0.2819416298082479
- KS Significance: not significant
- AD Statistic: 0.29117366712216736
- AD Significance: not significant
- Interpretation: Residual distributions are similar between train and test sets
DecisionTreeRegressor:
- KS Statistic: 0.13
- KS P-Value: 0.36818778606286096
- KS Significance: not significant
- AD Statistic: 0.5964780019496508
- AD Significance: not significant
- Interpretation: Residual distributions are similar between train and test sets
RandomForestRegressor:
- KS Statistic: 0.11
- KS P-Value: 0.5830090612540064
- KS Significance: not significant
- AD Statistic: 0.11362768335821165
- AD Significance: not significant
- Interpretation: Residual distributions are similar between train and test sets