Generalization Gap

Generalization Gap

Problem Summary

Let’s define our terms and the problem we are having (also see the glossary).

  1. A dataset is collected from the domain and split into a train set used for model selection and a test set used for the evaluation for the chosen model.
  2. A test harness is used to evaluate many candidate models on the train set by estimating their generalized performance (e.g. a subsequent train/test split, k-fold cross-validation, etc.).
  3. A single chosen model is (eventually) selected using an estimate of its generalized performance on the test harness (here, “model” refers to the pipeline of data transforms, model architecture, model hyperparameters, calibration of predictions, etc.).
  4. The chosen model is then fit on the entire train set and evaluated on the test set to give a single unbiased point estimate of generalized performance.
  5. The difference between 1) the test harness model performance on the train set and 2) the point estimation of the model fit on the train set and evaluated on the test set do not match. Why?

Unfortunately, when training the model, we don’t have access to the test set (by assumption), so we cannot use the test set to pick the model of the right complexity. However, we can create a test set by partitioning the training set into two: the part used for training the model, and a second part, called the validation set, used for selecting the model complexity. We then fit all the models on the training set, and evaluate their performance on the validation set, and pick the best. Once we have picked the best, we can refit it to all the available data. If we have a separate test set, we can evaluate performance on this, in order to estimate the accuracy of our method.

– Page 23, Machine Learning: A Probabilistic Perspective (2012)

Variations on this problem:

  1. Performance on the test harness and test set appropriately match, but performance on a hold out set is worse.
  2. Performance on the test harness and test set appropriately match, but performance on data in production is worse.

Problem Name

I refer to this problem as the “Generalization Gap”, but this problem goes by many names, for example:

  • Generalization Error
  • Generalization Gap
  • Overfitting
  • Train-Test Gap

Technically, the generalization gap is a description of the problem and overfitting is but one possible cause of the problem.

Overfitting is the central problem in machine learning.

– Page 71, The Master Algorithm, Pedro Domingos, 2015.

Problem Schematic

The following cartoon gives a sense of the structure of this problem:

graph TD
    A[Dataset]
    A --> C[**Train Set**]
    A --> D[**Test Set**]
    C --> E[Fit Model on **Train Set**]

    E --> H[**Train Set** Scores: GOOD]
    E --> I[**Test Set** Scores: BAD]
    D -->I

    H --> J{Why?}
    I --> J

    style H fill:#d4f8d4,stroke:#333,stroke-width:2px
    style I fill:#f8d4d4,stroke:#333,stroke-width:2px

Further Reading