Model Overfitting

Diagnostics

Model Overfitting

Overfitting happens when a model learns the training data too precisely, including its noise and peculiarities, resulting in poor generalization to new, unseen data.

Learn more about overfitting:

What is Overfitting

We assume that model performance on train data generalizes to new data. As such, model performance on the train and hold out data in the test harness should have the same distribution.

Significantly better model performance on train data compared to performance on hold out data (e.g. validation data or test folds) may indicate overfitting.

Is there evidence that the chosen model has overfit the train data compared to hold out data in the test harness?

Performance Distribution Tests

Are you able to evaluate the model using repeated resampling (e.g. k-fold cross-validation)?

Evaluate the model and gather performance scores on train and test folds.

Are performance scores on train and test folds highly correlated (e.g. Pearson’s or Spearman’s)?
- Check Performance Score Correlation
Do performance scores on train and test folds have the same central tendencies (e.g. paired t-Test or Wilcoxon signed-rank test)?
- Compare Performance Score Central Tendencies
Do performance scores on train and test folds have the same distributions (e.g. Kolmogorov-Smirnov Test or Anderson-Darling Test)?
- Compare Performance Distributions
Do performance score points on a scatter plot fall on the expected diagonal line?
- Compare Performance Scores Visually

Note: These tests have a lot in common with Model Performance Tests, except applied within the test harness only.

Loss Curve Tests

Are you able to evaluate the model after each training update (aka training epoch, model update, etc.)?

Split the training dataset into train/validation sets and evaluate the model loss on each set after each model update.

Does a line plot of model loss show signs of overfitting (improving/level performance on train, improving then worsening performance on validation i.e., the so-called “U” shape of a loss plot)?
- Compare Loss Curves

Validation Curve Tests

Are you able to vary model capacity?

Vary model capacity and evaluate model performance with each configuration value and record train and hold out performance.

Note where the currently chosen configuration (if any) falls on the plot.

Does a line plot of capacity vs performance scores show signs of overfitting?
- Compare Validation Curves

Learning Curve Tests

Vary the size of the training data and evaluate model performance.

Note whether there model scales, e.g. performance improves with more data.

Does the line plot of train set size vs performance scores show signs of overfitting?
- Compare Learning Curves

Visualize The Fit

In some cases we can visualize the domain and the model fit in the domain (possible in only the simplest domains).

Visualizing the fit relative to the data can often reveal the condition of the fit, e.g. good fit, underfit, overfit.

Visualize the decision surface for the model (classification only).
Visualize a scatter plot for data and line plot for the fit (regression only).

Overfitting Test Set

A model has overfit the test set if a model is trained on the entire train set and evaluated on the test set and its performance is better than performance of a new model fit on the whole dataset (train + test sets) and evaluated on new data (e.g. in production).

Is there evidence that the chosen model has overfit the test set?

This can happen if the test set is evaluated and knowledge is used in the test harness in some way, such as: choosing a model, model configuration, data preparation, etc.

This is called test set leakage and results in an optimistic estimate of model performance.

See Data Leakage.