Model Performance

Diagnostics

Model Performance

We assume model performance on the train set and test set generalizes to new data. As such, model performance on the train and test sets have the same distribution.

Is there evidence that general model performance on the train and test sets have the same distributions?

⚠️

Warning: Do not report or inspect model performance scores and remove all axis from plots. You may learn details about what works well/poorly on the test set which may influence model selection, introducing test set leakage and potential bias in the final model.

Prerequisite

We require a single test harness that we can use to evaluate a suite of standard models on the train set and the test set that results in multiple (>3) estimates of model performance on hold out data.

Choose a diverse suite (10+) of standard machine learning algorithms with good default hyperparameters that appropriate for your predictive modeling task (e.g. DummyModel, SVM, KNN, DecisionTree, Linear, NeuralNet, etc.)
Choose a standard test harness:
- Did you use a train/test split for model selection on the test harness?
  - Perform multiple (e.g. 5 or 10) train/test splits on the train set and the test set and gather the two samples of performance scores.
- Did you use k-fold cross-validation for model selection (e.g. kfold, stratified kfold, repeated kfold, etc.) in your test harness?
  - Apply the same cross-validation evaluation procedure to the train set and the test set and gather the two samples of performance scores.
- Is performing multiple fits of the chosen model too expensive and you used one train/val split on the train set?
  - Evaluate the same chosen model on many (e.g. 30, 100, 1000) bootstrap samples of the validation set and the test set and gather the two samples of performance scores.
Evaluate the suite of algorithms on the train set and test set using the same test harness.
Gather the sample of hold out performance scores on the train set and test set for each algorithm.

Checks

We now have a sample of performance scores on the train set and another for the test set for each algorithm.

Next, compare the distributions of these performance scores (all together or per algorithm or both).

Are performance scores highly correlated (e.g. Pearson’s or Spearman’s)?
- Check Performance Score Correlation
Do the performance scores on train and test sets have the same central tendencies (e.g. t-Test or Mann-Whitney U Test)?
- Compare Performance Score Central Tendencies
Do the performance scores on train and test sets have the same distributions (e.g. Kolmogorov-Smirnov Test or Anderson-Darling Test)?
- Compare Performance Score Distributions
Do the performance scores on train and test sets have the same variance (e.g. F-test, Levene’s test)?
- Compare Performance Score Variances
Do the performance scores on train and test sets have a low divergence in distributions (e.g. KL-divergence or JS-divergence)?
- Compare Performance Score Divergence
Is the effect size of the difference in performance between the train and test scores small (e.g. Cohen’s d or Cliff’s delta)?
- Compare Performance Score Effect Size
Do performance score points on a scatter plot fall on the expected diagonal line?
- Compare Performance Scores Visually

Extensions

The above tests are used to compare performance score distributions in the train set with the test set:

Train Set vs Test Set

After that, consider repeating all tests for the combinations:

Train Set vs Whole Dataset (Train Set + Test Set)
Test Set vs Whole Dataset (Train Set + Test Set)

In some domains you may be able to collect additional large samples from the domain with which you can compare performance score distributions with the train and test sets:

Train Set vs New Large Sample
Test Set vs New Large Sample