Residual Distributions

Residual Distributions

We assume that the distribution of residual prediction errors is representative of residuals in the broader domain.

As such, residual errors on the train set have the same distribution as the test set.

Is there evidence that model residual errors on the train set and test set have the same distributions?

Prerequisite

We require a single test harness that we can use to evaluate a suite of standard models on the train set and the test set that results in a modest sample of prediction errors.

  1. Choose a diverse suite (10+) of standard machine learning algorithms with good default hyperparameters that appropriate for your predictive modeling task (e.g. DummyModel, SVM, KNN, DecisionTree, Linear, NeuralNet, etc.)
  2. Choose a standard test harness:
    • Did you use a train/test split for model selection on the test harness?
      • Perform multiple (e.g. 5 or 10) train/test splits on the train set and the test set and gather the two samples of performance scores.
    • Did you use k-fold cross-validation for model selection (e.g. kfold, stratified kfold, repeated kfold, etc.) in your test harness?
      • Apply the same cross-validation evaluation procedure to the train set and the test set and gather the two samples of performance scores.
    • Is performing multiple fits of the chosen model too expensive and you used one train/val split on the train set?
      • Evaluate the same chosen model on many (e.g. 30, 100, 1000) bootstrap samples of the validation set and the test set and gather the two samples of performance scores.
  3. Evaluate the suite of algorithms on the train set and test set using the same test harness.
  4. Gather the sample of prediction errors on the hold out data for the train set and test set for each algorithm.

Checks

We now have a sample of prediction errors on the train set and another for the test set for each algorithm.

Next, compare the distributions of these errors (all together, or per algorithm, or both).

  1. Are residual errors on train and test sets highly correlated (e.g. Pearson’s or Spearman’s)?
  2. Do the residual errors on train and test sets have the same central tendencies (e.g. t-Test or Mann-Whitney U Test)?
  3. Do the residual errors on train and test sets have the same distributions (e.g. Kolmogorov-Smirnov Test or Anderson-Darling Test)?
  4. Do the residual errors on train and test sets have the same variance (e.g. F-test, Levene’s test)?
  5. Is the effect size of the difference in residual errors between the train and test scores small (e.g. Cohen’s d)?
  6. Do the confusion matrices of predictions from the train and test sets have the same distributions (e.g. Chi-Squared Test) (classification only)?