Model Performance
We assume model performance on the train set and test set generalizes to new data. As such, model performance on the train and test sets have the same distribution.
Is there evidence that general model performance on the train and test sets have the same distributions?
Prerequisite
We require a single test harness that we can use to evaluate a suite of standard models on the train set and the test set that results in multiple (>3) estimates of model performance on hold out data.
- Choose a diverse suite (10+) of standard machine learning algorithms with good default hyperparameters that appropriate for your predictive modeling task (e.g. DummyModel, SVM, KNN, DecisionTree, Linear, NeuralNet, etc.)
- Choose a standard test harness:
- Did you use a train/test split for model selection on the test harness?
- Perform multiple (e.g. 5 or 10) train/test splits on the train set and the test set and gather the two samples of performance scores.
- Did you use k-fold cross-validation for model selection (e.g. kfold, stratified kfold, repeated kfold, etc.) in your test harness?
- Apply the same cross-validation evaluation procedure to the train set and the test set and gather the two samples of performance scores.
- Is performing multiple fits of the chosen model too expensive and you used one train/val split on the train set?
- Evaluate the same chosen model on many (e.g. 30, 100, 1000) bootstrap samples of the validation set and the test set and gather the two samples of performance scores.
- Did you use a train/test split for model selection on the test harness?
- Evaluate the suite of algorithms on the train set and test set using the same test harness.
- Gather the sample of hold out performance scores on the train set and test set for each algorithm.
Checks
We now have a sample of performance scores on the train set and another for the test set for each algorithm.
Next, compare the distributions of these performance scores (all together or per algorithm or both).
- Are performance scores highly correlated (e.g. Pearson’s or Spearman’s)?
- Do the performance scores on train and test sets have the same central tendencies (e.g. t-Test or Mann-Whitney U Test)?
- Do the performance scores on train and test sets have the same distributions (e.g. Kolmogorov-Smirnov Test or Anderson-Darling Test)?
- Do the performance scores on train and test sets have the same variance (e.g. F-test, Levene’s test)?
- Do the performance scores on train and test sets have a low divergence in distributions (e.g. KL-divergence or JS-divergence)?
- Is the effect size of the difference in performance between the train and test scores small (e.g. Cohen’s d or Cliff’s delta)?
- Do performance score points on a scatter plot fall on the expected diagonal line?
Extensions
The above tests are used to compare performance score distributions in the train set with the test set:
- Train Set vs Test Set
After that, consider repeating all tests for the combinations:
- Train Set vs Whole Dataset (Train Set + Test Set)
- Test Set vs Whole Dataset (Train Set + Test Set)
In some domains you may be able to collect additional large samples from the domain with which you can compare performance score distributions with the train and test sets:
- Train Set vs New Large Sample
- Test Set vs New Large Sample