Challenge Results
Challenge the Performance Gap
Often the train set score on the test harness is a mean (e.g. via k-fold cross-validation) and the test set score is a point estimate.
Perhaps comparing these different types of quantities is the source of the problem?
We change the model evaluation scheme in one or both cases so that we can attempt an apples-to-apples comparison of model performance on the train and test sets.
Is there statistical evidence that the chosen model has the same apples-to-apples performance on the train and test sets?
Prerequisite
We require a single test harness that we can use to evaluate the single chosen model on the train set and the test set that results in multiple (>3) estimates of model performance on hold out data.
- Did you use a train/test split for model selection on the test harness?
- Perform multiple (e.g. 5 or 10) train/test splits on the train set and the test set and gather the two samples of performance scores.
- Did you use k-fold cross-validation for model selection (e.g. kfold, stratified kfold, repeated kfold, etc.) in your test harness?
- Apply the same cross-validation evaluation procedure to the train set and the test set and gather the two samples of performance scores.
- Is performing multiple fits of the chosen model too expensive and you used one train/val split on the train set?
- Evaluate the same chosen model on many (e.g. 30, 100, 1000) bootstrap samples of the validation set and the test set and gather the two samples of performance scores.
Checks
We now have a sample of performance scores on the train set and another for the test set.
Next, compare the distributions of these performance scores.
- Are performance scores highly correlated (e.g. Pearson’s or Spearman’s)?
- Do the performance scores on train and test sets have the same central tendencies (e.g. t-Test or Mann-Whitney U Test)?
- Do the performance scores on train and test sets have the same distributions (e.g. Kolmogorov-Smirnov Test or Anderson-Darling Test)?
- Do the performance scores on train and test sets have the same variance (e.g. F-test, Levene’s test)?
- Is the effect size of the difference in performance between the train and test scores is small (e.g. Cohen’s d)?
- Is divergence in the performance distributions between the train and test scores is small (e.g. KL-divergence or JS-divergence)?