Performance Gap
Performance Gap
⚠️
Warning: Do not report or inspect model performance scores and remove all axis from plots. You may learn details about what works well/poorly on the test set which may influence model selection, introducing test set leakage and potential bias in the final model.
Quantify the Performance Gap
There is variance in the model (e.g. random seed, specific training data, etc.) and variance in its performance (e.g. test harness, specific test data, etc.).
Vary these elements and see if the distributions overlap or not.
If they do, any gap might be statistical noise (e.g. not a real effect). If they don’t, it might be a real issue that requires more digging.
Is there evidence that the performance gap between the test harness and the test set is a warrants further investigation?
Checks
- Are you optimizing one (and only one) performance metric for the task?
- What are the specific performance scores on the test harness and test set (e.g. min/max, mean/stdev, point estimate, etc.).
- What is the generalization gap, calculated as the absolute difference in scores between the test harness and test set?
- Does the chosen model outperform a dummy/naive model (e.g. predict mean/median/mode etc.) on the test harness and test set (e.g. does it have any skill)?
- Are you able to evaluate a diverse suite of many (e.g. 10+) untuned standard machine learning models on the training and test sets and calculate their generalization gap (absolute difference) to establish a reference range?
- Is the chosen models’ performance gap within the calculate a reference range for the performance gap for your test harness and test set (e.g. min/max, confidence interval, standard error, etc.)?
- Are you able to evaluate the final chosen model many times (e.g. 30+, 100-1000) on bootstrap samples of the test set to establish a test set performance distribution?
- Is the train set (test harness) performance within the test set distribution (e.g. min/max, confidence interval, etc.)
- Are you able to develop multiple estimates of model performance on the training dataset (e.g. k-fold cross-validation, repeated train/test splits, bootstrap, checkpoint models, etc.)?
- Is the test set performance within the train set (test harness) performance distribution (e.g. min/max, confidence interval, etc.)?
- Are you able to fit and evaluate multiple different versions of the chosen model on the train set (e.g. with different random seeds)?
- Is the test set performance within the train set (test harness) performance distribution (e.g. min/max, confidence interval, etc.)?